Min Yang
Other people with similar names: Min Yang, Min Yang
Unverified author pages with similar names: Min Yang
2026
Beyond Quantity: Trajectory Diversity Scaling for Code Agents
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling; moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Moreover, TDScaling is more data-efficient: under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a Blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, 𝜏2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. Crucially, we show that trajectory diversity scaling attains a substantially higher performance ceiling than quantity scaling, establishing a resource-efficient paradigm for training robust code agents under data bottlenecks.
ToolRM: Towards Agentic Tool-Use Reward Modeling
Renhao Li | Jianhong Tu | Yang Su | Yantao Liu | Fei Huang | Hamid Alinejad-Rokny | Derek F. Wong | Junyang Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Renhao Li | Jianhong Tu | Yang Su | Yantao Liu | Fei Huang | Hamid Alinejad-Rokny | Derek F. Wong | Junyang Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBenchBFCL, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models
Shuaimin Li | Liyang Fan | Zeyang li | Zhuoyue Wan | Yufang Lin | Shiwen Ni | Feiteng Fang | Hamid Alinejad-Rokny | Yuanfeng Song | Kun Jing | Chen Jason Zhang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Shuaimin Li | Liyang Fan | Zeyang li | Zhuoyue Wan | Yufang Lin | Shiwen Ni | Feiteng Fang | Hamid Alinejad-Rokny | Yuanfeng Song | Kun Jing | Chen Jason Zhang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Evaluating code large language models (Code LLMs) requires reliable detection of data leakage, where benchmark performance is artificially inflated by exposure to benchmark data during pre-training. Existing approaches either assume access to proprietary training corpora, rely on brittle heuristics such as timestamp filtering, or use external reference sets with manually tuned, non-generalizable thresholds. To address these limitations, we introduce SrDetection, a unified self-referential leakage detection framework for both gray-box (access to model logits) and black-box (access to model outputs) settings. SrDetection generates semantically equivalent variants of a benchmark sample and detects leakage by contrasting the model’s behavior on the original versus its variants, flagging cases where the original is disproportionately easier for the model. We further design a controlled leakage detection testbed and evaluate SrDetection in this environment. Across different models and training stages, SrDetection improves average F1 by 21.52 points in the gray-box setting and 14.46 points in the black-box setting over strong baselines, demonstrating robust, threshold-independent leakage detection. Finally, a gray-box study of 15 widely used Code LLMs on four popular benchmarks reveals benchmark-specific leakage patterns beyond prior overlap-based analyses[Source code and data are available at <https://github.com/SMinL/SrDetectionCode>].
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal – how much of a CoT is necessary versus structurally redundant – that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning
Ruijun Huang | Zhiqiao Kang | Yuxuan Zhu | Junxiong Li | Jiahao Zhao | Minghuan Tan | Feng Jiang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Ruijun Huang | Zhiqiao Kang | Yuxuan Zhu | Junxiong Li | Jiahao Zhao | Minghuan Tan | Feng Jiang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
The accurate extraction of scientific measurements from literature is a critical yet challenging task in AI4Science, enabling large-scale analysis and integration of quantitative research findings. However, Large Language Models (LLMs) frequently exhibit severe hallucinations, which significantly undermine the reliability of automated scientific document understanding systems. To address this problem, we propose MeasHalu, a novel framework for mitigating scientific measurement hallucinations through enhanced reasoning and targeted optimization. We first present a fine-grained taxonomy of measurement-specific hallucinations, categorizing errors across quantities, units, modifiers, and relations. Our approach incorporates a two-stage reasoning-aware fine-tuning strategy using augmented scientific data and process-based supervision. Furthermore, we introduce a progressive reward curriculum designed to penalize specific hallucination types, significantly improving extraction faithfulness. Experimental results demonstrate that MeasHalu substantially reduces hallucination rates and improves overall accuracy on the MeasEval benchmark. This work provides a targeted solution to a key bottleneck in automated scientific knowledge extraction, facilitating more trustworthy and scalable machine-assisted scientific literature analysis.
Towards IP Intelligence: Benchmarking Large Language Models on Intellectual Property Knowledge and Practice
Qiyao Wang | Guhong Chen | Hongbo Wang | Huaren Liu | Minghui Zhu | Zhifei Qin | Li Linwei | Yilin Yue | Shiqiang Wang | Jiayan Li | Wu Yihang | Ziqiang Liu | Longze Chen | Run Luo | Liyang Fan | Jiaming Li | Lei Zhang | Kan Xu | Hamid Alinejad-Rokny | Chengming Li | Shiwen Ni | Yuan Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Qiyao Wang | Guhong Chen | Hongbo Wang | Huaren Liu | Minghui Zhu | Zhifei Qin | Li Linwei | Yilin Yue | Shiqiang Wang | Jiayan Li | Wu Yihang | Ziqiang Liu | Longze Chen | Run Luo | Liyang Fan | Jiaming Li | Lei Zhang | Kan Xu | Hamid Alinejad-Rokny | Chengming Li | Shiwen Ni | Yuan Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce **IPBench**, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing **8 IP mechanisms and 20 distinct tasks**, designed to evaluate LLMs in real-world IP practice. We benchmark **19 main LLMs**, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data, code in the supplementary materials.
Act-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective Ambiguity
Feiteng Fang | Dingwei Chen | Xiang Huang | Ting-En Lin | Yuchuan Wu | Xiong Liu | Jing Ye | Ziqiang Liu | Haonan Zhang | Liang Zhu | Hamid Alinejad-Rokny | Min Yang | Yongbin Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Feiteng Fang | Dingwei Chen | Xiang Huang | Ting-En Lin | Yuchuan Wu | Xiong Liu | Jing Ye | Ziqiang Liu | Haonan Zhang | Liang Zhu | Hamid Alinejad-Rokny | Min Yang | Yongbin Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Currently, most reinforcement learning tasks focus on domains like mathematics and programming, where verification is relatively straightforward. However, in subjective tasks such as role-playing, alignment techniques struggle to make progress, primarily because subjective reward modeling using the Bradley-Terry model faces significant challenges when dealing with ambiguous preferences. To improve reward modeling in subjective tasks, this paper proposes AAM (Act-Adaptive Margin), which enhances reward modeling by dynamically calibrating preference margins using the model’s internal parameter knowledge. We design two versions of AAM that efficiently generate contextually-appropriate preference gaps without additional human annotation. This approach fundamentally improves how reward models handle subjective rewards by better integrating generative understanding with preference scoring. To validate AAM’s effectiveness in subjective reward modeling, we conduct evaluations on RewardBench, JudgeBench, and challenging role-playing tasks. Results show that AAM significantly improves subjective reward modeling performance, enhancing Bradley-Terry reward models by 2.95% in general tasks and 4.85% in subjective role-playing tasks. Furthermore, reward models trained with AAM can help downstream alignment tasks achieve better results. Our test results show that applying rewards generated by AAM-Augmented RM to preference learning techniques (e.g., GRPO) achieves state-of-the-art results on CharacterEval and Charm. The code and dataset will be released upon acceptance.
TInR: Exploring Tool-Internalized Reasoning in Large Language Models
Qiancheng Xu | Yongqi Li | Fan Liu | Hongru Wang | Min Yang | Wenjie Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiancheng Xu | Yongqi Li | Fan Liu | Hongru Wang | Min Yang | Wenjie Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency. The codes are attached in the supplementary file for review.
Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
Xin Chen | Feng Jiang | Yiqian Zhang | Hardy Chen | Shuo Yan | Wenya Xie | Min Yang | Shujian Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xin Chen | Feng Jiang | Yiqian Zhang | Hardy Chen | Shuo Yan | Wenya Xie | Min Yang | Shujian Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a blind self-thinking paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR.
Search
Fix author
Co-authors
- Hamid Alinejad-Rokny 6
- Shiwen Ni 4
- Feiteng Fang 3
- Guhong Chen 2
- Liyang Fan 2
- Feng Jiang (蒋峰) 2
- Shuaimin Li 2
- Yongbin Li 2
- Ziqiang Liu 2
- Qiyao Wang 2
- Ahmadreza Argha 1
- Dingwei Chen 1
- Guangxu Chen 1
- Hardy Chen 1
- Longze Chen 1
- Xin Chen 1
- Cheng Fu 1
- Qi Han 1
- Fei Huang 1
- Ruijun Huang 1
- Shujian Huang (书剑 黄) 1
- Wenhao Huang 1
- Xiang Huang 1
- Zhihong Huang 1
- Kun Jing 1
- Zhiqiao Kang 1
- Binhua Li 1
- Chengming Li 1
- Jiaming Li 1
- Jiayan Li 1
- Junxiong Li 1
- Renhao Li 1
- Siyi Li 1
- Wenjie Li 1
- Yizhi Li 1
- Yongqi Li 1
- Zeyang Li 1
- Junyang Lin 1
- Ting-En Lin 1
- Yuan Lin 1
- Yufang Lin 1
- Li Linwei 1
- Fan Liu 1
- Huaren Liu 1
- Jiaheng Liu 1
- Xiong Liu 1
- Yantao Liu 1
- Run Luo 1
- Zhifei Qin 1
- Qiang Qu 1
- Jiajun Shi 1
- Yuanfeng Song 1
- Yang Su 1
- Chenghao Sun 1
- Minghuan Tan 1
- Jianhong Tu 1
- Zhuoyue Wan 1
- Hongbo Wang 1
- Hongru Wang 1
- Shijian Wang 1
- Shiqiang Wang 1
- ChaoPeng Wei 1
- HU Wei 1
- Zhoufutu Wen 1
- Derek F. Wong (黄辉) 1
- Yuchuan Wu 1
- Wenya Xie 1
- Kan Xu 1
- Qiancheng Xu 1
- Xander Xu 1
- Shuo Yan 1
- Jing Ye 1
- Wu Yihang 1
- Yilin Yue 1
- Chen Jason Zhang 1
- Ge Zhang 1
- Haonan Zhang 1
- Lei Zhang 1
- Yiqian Zhang 1
- Bing Zhao 1
- Jiahao Zhao 1
- Liang Zhu 1
- Minghui Zhu 1
- Yuxuan Zhu 1