Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Dongjie Cheng; Yongqi Li; Zhixin Ma; Hongru Cai; Yupeng Hu; Wenjie Wang; Liqiang Nie; Wenjie Li

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, Wenjie Li

Abstract

Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning. The code and checkpoints are attached for reproducibility and subsequent open release.

Anthology ID:: 2026.findings-acl.1937
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38907–38924
Language:
URL:: https://aclanthology.org/2026.findings-acl.1937/
DOI:
Bibkey:
Cite (ACL):: Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, and Wenjie Li. 2026. Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38907–38924, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning (Cheng et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1937.pdf
Checklist:: 2026.findings-acl.1937.checklist.pdf

PDF Cite Search Checklist Fix data