Kazuhito Koishida


2026

Graphical User Interface (GUI) grounding is critical for effective GUI agents. Despite recent progress, key challenges remain: 1) existing grounding models and benchmarks are skewed toward web and mobile environments, neglecting desktop interfaces (especially windows); and 2) grounding capability is assessed using accuracy on a single "best" instruction per UI element. However, users can refer to a UI element in diverse valid ways – via visual attributes, spatial relations, etc, and a capable grounding model should produce consistent outputs across such variations. Focusing on desktop environments, we introduce GUI Grounding Sensitivity Benchmark, which investigates the model sensitivity to multiple descriptions of the same UI element. We design an automatic pipeline to generate multiple valid instructions per UI element, and develop nuanced data validation methods, as frontier models even hallucinate to produce a single instruction. Evaluation of 12 models reveals they are reasonably sensitive and their performance on existing benchmarks does not reflect their true ability. Building on the insight that a given grounding model struggles more with certain instructions or relations, we introduce the GUI Grounding Diagnosis Agent, which generates challenging instructions using model feedback and iterative refinement. Our agent reports high success rate (upto 84%) in generating instructions that fail the state-of-the-art GUI grounding models.

2025

Graphical User Interface (GUI) automation relies on accurate GUI grounding. However, obtaining large-scale, high-quality labeled data remains a key challenge, particularly in desktop environments like Windows Operating System (OS). Existing datasets primarily focus on structured web-based elements, leaving a gap in real-world GUI interaction data for non-web applications. To address this, we introduce a new framework that leverages LLMs to generate large-scale GUI grounding data, enabling automated and scalable labeling across diverse interfaces. To ensure high accuracy and reliability, we manually validated and refined 5,000 GUI coordinate-instruction pairs, creating WinSpot—the first benchmark specifically designed for GUI grounding tasks in Windows environments. WinSpot provides a high-quality dataset for training and evaluating visual GUI agents, establishing a foundation for future research in GUI automation across diverse and unstructured desktop environments.