SimpleOCR: Rendering Visual Questions to Teach MLLMs to Read

Yibo Peng; Peng Xia; Ding Zhong; Kaide Zeng; Siwei Han; Yiyang Zhou; Jiaqi Liu; Ruiyi Zhang; Huaxiu Yao

SimpleOCR: Rendering Visual Questions to Teach MLLMs to Read

Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao

Abstract

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely read text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated modality laziness. To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

Anthology ID:: 2026.findings-acl.519
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10697–10710
Language:
URL:: https://aclanthology.org/2026.findings-acl.519/
DOI:
Bibkey:
Cite (ACL):: Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, and Huaxiu Yao. 2026. SimpleOCR: Rendering Visual Questions to Teach MLLMs to Read. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10697–10710, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SimpleOCR: Rendering Visual Questions to Teach MLLMs to Read (Peng et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.519.pdf
Checklist:: 2026.findings-acl.519.checklist.pdf

PDF Cite Search Checklist Fix data