GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen; Yuying Ge; Rui Wang; Yixiao Ge; Junhao Cheng; Ying Shan; Xihui Liu

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu

Abstract

Recent reinforcement learning (RL) approaches, such as outcome-supervised GRPO, have advanced reasoning in Large Language Models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) remains underexplored. Progress has been further limited by the lack of evaluation settings that jointly test perception and reasoning under controlled generalization challenges. To enable such analysis, we present **SEED-Bench-R1**, a structured testbed featuring real-world video tasks and hierarchical evaluation across in-distribution, cross-environment, and cross-environment-task scenarios. Our analysis reveals that standard outcome-supervised GRPO often yields "logical incoherence"—achieving correct answers through flawed reasoning—due to its exclusive focus on final-answer rewards and rigid KL penalties. To address this, we propose **GRPO-CARE**, a consistency-aware RL framework that eliminates KL penalties while introducing a two-tiered reward system: a base reward for accuracy and an adaptive bonus for consistency. This bonus, derived from a slowly evolving reference model through group-relative likelihood calibration, rewards reasoning paths that logically support the final answer without requiring expensive process supervision. Experiments on SEED-Bench-R1 show that GRPO-CARE consistently outperforms standard GRPO, achieving a 6.7% gain on the hardest evaluation level and a 24.5% increase in reasoning consistency. Moreover, models trained with GRPO-CARE transfer effectively to diverse video understanding and even language-only reasoning benchmarks, validating its robustness and generality.

Anthology ID:: 2026.findings-acl.210
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4305–4320
Language:
URL:: https://aclanthology.org/2026.findings-acl.210/
DOI:
Bibkey:
Cite (ACL):: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. 2026. GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 4305–4320, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.210.pdf
Checklist:: 2026.findings-acl.210.checklist.pdf

PDF Cite Search Checklist Fix data