Jiaqi Liu
2026
SimpleOCR: Rendering Visual Questions to Teach MLLMs to Read
Yibo Peng | Peng Xia | Ding Zhong | Kaide Zeng | Siwei Han | Yiyang Zhou | Jiaqi Liu | Ruiyi Zhang | Huaxiu Yao
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Peng | Peng Xia | Ding Zhong | Kaide Zeng | Siwei Han | Yiyang Zhou | Jiaqi Liu | Ruiyi Zhang | Huaxiu Yao
Findings of the Association for Computational Linguistics: ACL 2026
Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely read text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated modality laziness. To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
Search
Fix author
Co-authors
- Tinggang Chen 1
- Minghe Gao 1
- Tao Gui 1
- Honglin Guo 1
- Xin Guo 1
- Siwei Han 1
- Baodai Huang 1
- Jixuan Huang 1
- Xuan-Jing Huang (黄萱菁) 1
- Jiaming Ji 1
- Guohao Li 1
- Chenyu Liu 1
- Dongrui Liu 1
- Zhonghang Lu 1
- Yibo Peng 1
- Jiajun Sun 1
- Junzhe Wang 1
- Zhiheng Xi 1
- Peng Xia 1
- Dingwen Yang 1
- Yuming Yang 1
- Huaxiu Yao 1
- Junjie Ye (叶俊杰) 1
- Kaide Zeng 1
- Jiazheng Zhang 1
- Qi Zhang 1
- Qi Zhang 1
- Ruiyi Zhang 1
- Zhihao Zhang 1
- Ding Zhong 1
- Yiyang Zhou 1
- Dingwei Zhu 1