Shuai Ren
2026
A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of mobile graphic user interface (GUI) AI agents, which is designed to autonomously perform tasks on mobile devices. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real-world online mobile apps. To address this gap, we present Android Agent Arena (A3), a novel "essential-state" based procedural evaluation system for mobile GUI agents. A3 introduces a benchmark of 100 tasks derived from 20 widely-used, dynamic online apps across 20 categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential-state" based procedural evaluation method that leverages MLLMs as reward models to progressively verify task completion and process achievement. This evaluation approach address the limitations of traditional function based evaluation methods on online dynamic apps. Furthermore, A3 includes a toolkit to streamline Android device interaction, reset online environment and apps and facilitate data collection from both human and agent demonstrations. The complete A3 system, including the benchmark and tools, will be publicly released to provide a robust foundation for future research and development in mobile GUI agents.
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark
Guangyi Liu | Pengxiang Zhao | Liang Liu | Zhiming Chen | Yuxiang Chai | Yaozhen Liang | WenHao Wang | Siheng Chen | Zhengxi Lu | Shuai Ren | Hao Wang | Shibo He | Yong Liu | Wenchao Meng
Findings of the Association for Computational Linguistics: ACL 2026
Guangyi Liu | Pengxiang Zhao | Liang Liu | Zhiming Chen | Yuxiang Chai | Yaozhen Liang | WenHao Wang | Siheng Chen | Zhengxi Lu | Shuai Ren | Hao Wang | Shibo He | Yong Liu | Wenchao Meng
Findings of the Association for Computational Linguistics: ACL 2026
Mobile GUI agents show promise in automating tasks but face significant generalization challenges in long-tail scenarios. While learning from few-shot demonstrations is an emerging solution, its progress is hindered by two critical gaps: the lack of a comprehensive benchmark for systematic evaluation on mobile devices, and the absence of a systematic framework designed to learn from demonstrations in this domain. To address these gaps, we introduce LearnGUI, the first comprehensive benchmark designed for studying demonstration-based learning in mobile agents, comprising 2,252 offline and 101 online tasks. We further develop LearnAct, a modular agent framework engineered to systematically extract, retrieve, and leverage knowledge from visual demonstrations. Extensive evaluations across six backbone models validate our approach: LearnAct achieves dramatic improvements for general-purpose models (e.g., Gemini-2.5-Pro: 38.5%→58.9%) and specialized models alike (e.g., UI-TARS-7B-SFT’s online success rate: 18.1%→32.8%), demonstrating consistent gains across model architectures. Our work provides a robust benchmark and a systematic framework, paving the way for more adaptable and practical mobile agents. Our code and data are publicly available at https://lgy0404.github.io/LearnAct/.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
Jichao Wang | Liuyang Bian | Yufeng Zhou | Han Xiao | Yue Pan | Guozhi Wang | Hao Wang | Zhaoxiong Wang | Yafei Wen | Xiaoxin Chen | Shuai Ren | Lingfang Zeng
Findings of the Association for Computational Linguistics: ACL 2026
Jichao Wang | Liuyang Bian | Yufeng Zhou | Han Xiao | Yue Pan | Guozhi Wang | Hao Wang | Zhaoxiong Wang | Yafei Wen | Xiaoxin Chen | Shuai Ren | Lingfang Zeng
Findings of the Association for Computational Linguistics: ACL 2026
As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma.Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi Online Long-horizon RL). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality—effectively simulating online feedback without interaction costs.Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
2025
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Yuxiang Chai | Siyuan Huang | Yazhe Niu | Han Xiao | Liang Liu | Guozhi Wang | Dingyu Zhang | Shuai Ren | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
Yuxiang Chai | Siyuan Huang | Yazhe Niu | Han Xiao | Liang Liu | Guozhi Wang | Dingyu Zhang | Shuai Ren | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.
SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Xudong Lu | Haohao Gao | Renshou Wu | Shuai Ren | Xiaoxin Chen | Hongsheng Li | Fangyuan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xudong Lu | Haohao Gao | Renshou Wu | Shuai Ren | Xiaoxin Chen | Hongsheng Li | Fangyuan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce **SmartBench**, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
Search
Fix author
Co-authors
- Yuxiang Chai 3
- Hongsheng Li 3
- Liang Liu (陆亮) 3
- Guozhi Wang 3
- Han Xiao 3
- Xiaoxin Chen (陈晓昕) 2
- Siyuan Huang 2
- Guangyi Liu 2
- Pengxiang Zhao 2
- Liuyang Bian 1
- Siheng Chen 1
- Zhiming Chen 1
- Haohao Gao 1
- Rongduo Han 1
- Shibo He 1
- Fangyuan Li 1
- Hanhao Li 1
- Yaozhen Liang 1
- Weifeng Lin 1
- Yong Liu 1
- Xudong Lu 1
- Zhengxi Lu 1
- Wenchao Meng 1
- Yazhe Niu 1
- Yue Pan 1
- Shunye Tang 1
- Hao Wang 1
- Hao Wang 1
- Jichao Wang 1
- Wenhao Wang 1
- Zhaoxiong Wang 1
- Yafei Wen 1
- Renshou Wu 1
- Lingfang Zeng 1
- Dingyu Zhang 1
- Haining Zhang 1
- Jiayu Zhang 1
- Yufeng Zhou 1