Jingjie Zheng


pdf bib
MUG: Interactive Multimodal Grounding on User Interfaces
Tao Li | Gang Li | Jingjie Zheng | Purple Wang | Yang Li
Findings of the Association for Computational Linguistics: EACL 2024

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation—the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that iterative interaction significantly improves the absolute task completion by 18% over the entire test set and 31% over the challenging split. Our results lay the foundation for further investigation of the problem.


pdf bib
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
Yang Li | Gang Li | Luheng He | Jingjie Zheng | Hong Li | Zhiwei Guan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a large-scale dataset for widget captioning with crowdsourcing. Our dataset contains 162,860 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces.