Xiaoxi Jiang

2026

RiT: Rubrics-in-Thinking Reinforcement Learning for Improved Reasoning in Large Language Models
Xiaobin Tian | Shuai Yuan | Muyun Ding | Haonan Chen | Xiaoxi Jiang
Findings of the Association for Computational Linguistics: ACL 2026

Large Reasoning Models (LRMs) benefit from generating intermediate reasoning steps, enabling more reliable and interpretable decision-making. While outcome-based supervision has proven effective for LRMs across diverse tasks, it focuses solely on final answers and cannot guarantee high-quality intermediate reasoning. In contrast, existing process supervision is largely limited to verifiable domains such as mathematics or code, where intermediate steps can be explicitly checked, restricting its applicability to open-ended reasoning tasks. To address these limitations, we propose Rubrics-in-Thinking Reinforcement Learning (RiT), the first framework to introduce thinking-rubric supervision into intermediate reasoning. RiT automatically generates fine-grained rubrics and integrates them into a reward function via gated fusion with outcome-based rewards, guiding models to reason in a coherent and task-aligned manner, improving both intermediate steps and the final response. Experiments on reasoning-intensive and open-ended benchmarks demonstrate that RiT consistently outperforms outcome-only RL baselines.

pdf bib abs

Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent *confirmation bias*, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce **M**ulti-**A**gent **R**einforced self-**C**heck for **H**allucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate *information asymmetry*. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver’s original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.

Co-authors

Shujie Hu 1

Hao Li 1

Zhuo Li 1

Yu Qin 1

Venues

ACL1
Findings1

Fix author