Yibo Peng
2026
SimpleOCR: Rendering Visual Questions to Teach MLLMs to Read
Yibo Peng | Peng Xia | Ding Zhong | Kaide Zeng | Siwei Han | Yiyang Zhou | Jiaqi Liu | Ruiyi Zhang | Huaxiu Yao
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Peng | Peng Xia | Ding Zhong | Kaide Zeng | Siwei Han | Yiyang Zhou | Jiaqi Liu | Ruiyi Zhang | Huaxiu Yao
Findings of the Association for Computational Linguistics: ACL 2026
Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely read text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated modality laziness. To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.
When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?
Yibo Peng | James Song | Lei Li | Xinyu Yang | Mihai Christodorescu | Ravi Mangal | Corina S. Pasareanu | Haizhong Zheng | Beidi Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yibo Peng | James Song | Lei Li | Xinyu Yang | Mihai Christodorescu | Ravi Mangal | Corina S. Pasareanu | Haizhong Zheng | Beidi Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively on functional correctness. In this paper, we reveal a novel type of threat to real-world code-agents: functionally correct yet vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. With our proposed FCV-Attack, we demonstrate that SOTA LLMs (e.g., ChatGPT and Claude) and agent scaffolds (e.g., SWE-agent and OpenHands) are all vulnerable to this FCV threat; across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack. For example, for CWE-538 (information exposure vulnerability), the FCV-Attack attains an attack success rate of 40.7% on GPT-5 Mini + OpenHands. Our results reveal an important security threat overlooked by current evaluation paradigms and urge the development of security-aware defenses for code agents.