Binghai Wang
2026
MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
Jiahang Lin | Kai Hu | Binghai Wang | Yuhao Zhou | Zhiheng Xi | Honglin Guo | Shichun Liu | Junzhe Wang | Shihan Dou | Enyu Zhou | Hang Yan | Zhenhua Han | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiahang Lin | Kai Hu | Binghai Wang | Yuhao Zhou | Zhiheng Xi | Honglin Guo | Shichun Liu | Junzhe Wang | Shihan Dou | Enyu Zhou | Hang Yan | Zhenhua Han | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce **MM-Doc-R1**, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose **Similarity-based Policy Optimization (SPO)**, addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that **MM-Doc-R1** outperforms previous baselines by **10.4%**. Furthermore, **SPO** demonstrates superior performance over **GRPO**, boosting results by **5.0%** with Qwen3-8B and **6.1%** with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
Binghai Wang | Yantao Liu | Yuxuan Liu | Tianyi Tang | Shenzhi Wang | Chang Gao | Chujie Zheng | Yichang Zhang | Le Yu | Shixuan Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Bowen Yu | Fei Huang | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Binghai Wang | Yantao Liu | Yuxuan Liu | Tianyi Tang | Shenzhi Wang | Chang Gao | Chujie Zheng | Yichang Zhang | Le Yu | Shixuan Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Bowen Yu | Fei Huang | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
2024
Reward Modeling Requires Automatic Adjustment Based on Data Quality
Binghai Wang | Rui Zheng | Lu Chen | Zhiheng Xi | Wei Shen | Yuhao Zhou | Dong Yan | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
Binghai Wang | Rui Zheng | Lu Chen | Zhiheng Xi | Wei Shen | Yuhao Zhou | Dong Yan | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
In Reinforcement Learning from Human Feedback (RLHF), the reward model plays a crucial role in aligning language model outputs with human values. The human preference data used to train the reward model consists of a prompt and a response pair, with humans annotating which response better aligns with human value preferences. Due to the complexity and subjectivity of the annotation task, multiple organizations including OpenAI and Anthropic report significant noise in the human preference datasets, leading to instability and deviation in reward model training from human values. We discover that the difference in scores assigned to response pairs by the reward model effectively indicates the quality of data, and data of varying qualities show significant distinctions in reward model training. We introduce a method that automatically adjusts reward modeling based on data quality, reducing the impact of noise and making full use of dataset. Experiments on multiple human preference datasets demonstrate that our method stabilizes reward model training and significantly enhances the alignment performance of RLHF.
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning
Lu Chen | Rui Zheng | Binghai Wang | Senjie Jin | Caishuang Huang | Junjie Ye | Zhihao Zhang | Yuhao Zhou | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Lu Chen | Rui Zheng | Binghai Wang | Senjie Jin | Caishuang Huang | Junjie Ye | Zhihao Zhang | Yuhao Zhou | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.
Search
Fix author
Co-authors
- Tao Gui 4
- Xuan-Jing Huang (黄萱菁) 4
- Zhiheng Xi 3
- Lu Chen 2
- Qi Zhang 2
- Qi Zhang 2
- Rui Zheng 2
- Yuhao Zhou 2
- Shihan Dou 1
- Chang Gao 1
- Honglin Guo 1
- Zhenhua Han 1
- Kai Hu 1
- Caishuang Huang 1
- Fei Huang 1
- Senjie Jin 1
- Jiahang Lin 1
- Junyang Lin 1
- Shichun Liu 1
- Shixuan Liu (刘世萱) 1
- Yantao Liu 1
- Yuxuan Liu 1
- Wei Shen 1
- Tianyi Tang 1
- Junzhe Wang 1
- Shenzhi Wang 1
- Dong Yan 1
- Hang Yan 1
- Junjie Ye (叶俊杰) 1
- Bowen Yu 1
- Le Yu 1
- Yichang Zhang 1
- Zhihao Zhang 1
- Chujie Zheng 1
- Enyu Zhou 1
- Yuhao Zhou 1