Beichen Zhang

2025

As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

pdf bib abs

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex reasoning through long chain-of-thought, yet they struggle with precise computations and algorithmic operations. Integrating computational tools with LRMs remains challenging, particularly in activating and enhancing models’ tool-use capabilities without compromising their reasoning strengths. We address these challenges through START (Self-taught Reasoner with Tools), introducing two key innovations: (1) Hint-infer, a training-free approach that activates LRMs’ latent tool-use capabilities through artificial hints, enabling test-time performance scaling; (2) Hint-RFT, a self-training framework that enables models to learn effective tool utilization through diverse hint patterns and rejection-based data synthesis. Experiments show that START significantly improves state-of-the-art LRMs across challenging benchmarks, including competition-level mathematics (AMC23: 95.0%, AIME24: 75.6%) and graduate-level science questions (GPQA: 64.6%). Our analysis reveals that START not only enhances accuracy but also improves reasoning efficiency through strategic tool utilization, demonstrating broad applicability in complex reasoning scenarios.

pdf bib abs

Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes in mathematical reasoning of Large Language Models (LLMs).However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies.In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods.Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs.To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task.Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research.

2023

pdf bib abs

Although large language models (LLMs) have achieved excellent performance in a variety of evaluation benchmarks, they still struggle in complex reasoning tasks which require specific knowledge and multi-hop reasoning. To improve the reasoning abilities, we propose ChatCoT, a tool-augmented chain-of-thought reasoning framework for chat-based LLMs (e.g., ChatGPT). In ChatCoT, we model the chain-of-thought (CoT) reasoning as multi-turn conversations, to utilize tools in a more natural way through chatting. At each turn, LLMs can either interact with tools or perform the reasoning. Our approach can effectively leverage the multi-turn conversation ability of chat-based LLMs, and integrate the thought chain following and tools manipulation in a unified way. Specially, we initialize the early turns of the conversation by the knowledge about tools, tasks, and reasoning format, and propose an iterative tool-augmented reasoning step to perform step-by-step tool-augmented reasoning. The experiment results on two complex reasoning datasets (MATH and HotpotQA) have shown the effectiveness of ChatCoT on complex reasoning tasks, achieving a 7.9% relative improvement over the state-of-the-art baseline.

2022

pdf bib abs

Debiased Contrastive Learning of Unsupervised Sentence Representations
Kun Zhou | Beichen Zhang | Xin Zhao | Ji-Rong Wen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, contrastive learning has been shown to be effective in improving pre-trained language models (PLM) to derive high-quality sentence representations. It aims to pull close positive examples to enhance the alignment while push apart irrelevant negatives for the uniformity of the whole representation space. However, previous works mostly adopt in-batch negatives or sample from training data at random. Such a way may cause the sampling bias that improper negatives (false negatives and anisotropy representations) are used to learn sentence representations, which will hurt the uniformity of the representation space. To address it, we present a new framework DCLR (Debiased Contrastive Learning of unsupervised sentence Representations) to alleviate the influence of these improper negatives.In DCLR, we design an instance weighting method to punish false negatives and generate noise-based negatives to guarantee the uniformity of the representation space.Experiments on seven semantic textual similarity tasks show that our approach is more effective than competitive baselines. Our code and data are publicly available at the link: bluehttps://github.com/RUCAIBox/DCLR.

pdf bib abs

Think Beyond Words: Exploring Context-Relevant Visual Commonsense for Diverse Dialogue Generation
Yiting Liu | Liang Li | Beichen Zhang | Qingming Huang
Findings of the Association for Computational Linguistics: EMNLP 2022

Commonsense knowledge has been widely considered for building intelligent open-domain dialogue agents, aiming to generate meaningful and diverse responses. Previous works in this field usually lack the ability to effectively obtain and utilize auxiliary commonsense from the external visual world. In this paper, we argue that exploiting logical information in images related to context can be effective to enrich and steer the generation process. In view of this, we propose VICTOR, a context-relevant VIsual Commonsense enhanced dialogue generaTOR for generating coherent and informative responses. To obtain the associated visual commonsense, we devise a novel approach that expands topic words on the knowledge graph and maps them into daily scenarios. During the generation, the model adopts multimodal fusion mechanism to integrate visual and textual information, and adaptively combine their decoding distributions for better response generation. The experimental results on two public datasets show that our proposed method outperforms the latest competitive methods in terms of coherence and diversity.