Caishuang Huang
2026
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Junjie Ye | Caishuang Huang | Zhuohan Chen | Wenjie Fu | Chenyuan Yang | Leyi Yang | Yilong Wu | Peng Wang | Meng Zhou | Xiaolong Yang | Tao Gui | Qi Zhang | Zhongchao Shi | Jianping Fan | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Junjie Ye | Caishuang Huang | Zhuohan Chen | Wenjie Fu | Chenyuan Yang | Leyi Yang | Yilong Wu | Peng Wang | Meng Zhou | Xiaolong Yang | Tao Gui | Qi Zhang | Zhongchao Shi | Jianping Fan | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Instruction following refers to the ability of large language models (LLMs) to generate outputs that satisfy all specified constraints. Existing research has primarily focused on constraint categories, offering limited evaluation dimensions and little guidance for improving instruction-following abilities. To address this gap, we introduce MulDimIF, a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Based on this framework, we design a controllable instruction generation pipeline. Through constraint expansion, conflict detection, and instruction rewriting, we construct 9,106 code-verifiable samples. We evaluate 18 LLMs from six model families and find marked performance differences across constraint settings. For instance, average accuracy decreases from 80.82% at Level I to 36.76% at Level IV. Moreover, training with data generated by our framework significantly improves instruction following without compromising general performance. In-depth analysis indicates that these gains stem largely from parameter updates in attention modules, which strengthen constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval
Caishuang Huang | Yang Qiao | Rongyu Zhang | Junjie Ye | Pu Lu | Wuwenxi | Meng Zhou | Xiku Du | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Caishuang Huang | Yang Qiao | Rongyu Zhang | Junjie Ye | Pu Lu | Wuwenxi | Meng Zhou | Xiku Du | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce FinToolSyn, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06% improvement, providing a robust foundation for tool learning in financial scenarios.
VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.
2025
Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition
Yuming Yang | Wantong Zhao | Caishuang Huang | Junjie Ye | Xiao Wang | Huiyuan Zheng | Yang Nan | Yuran Wang | Xueying Xu | Kaixin Huang | Yunke Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics
Yuming Yang | Wantong Zhao | Caishuang Huang | Junjie Ye | Xiao Wang | Huiyuan Zheng | Yang Nan | Yuran Wang | Xueying Xu | Kaixin Huang | Yunke Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics
Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs’ generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly enhances LLMs’ Open NER capabilities. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. The data, models, and code are publicly available at https://github.com/UmeanNever/B2NER.
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
Junjie Ye | Guanyu Li | SongYang Gao | Caishuang Huang | Yilong Wu | Sixian Li | Xiaoran Fan | Shihan Dou | Tao Ji | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics
Junjie Ye | Guanyu Li | SongYang Gao | Caishuang Huang | Yilong Wu | Sixian Li | Xiaoran Fan | Shihan Dou | Tao Ji | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics
Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.
2024
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning
Lu Chen | Rui Zheng | Binghai Wang | Senjie Jin | Caishuang Huang | Junjie Ye | Zhihao Zhang | Yuhao Zhou | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Lu Chen | Rui Zheng | Binghai Wang | Senjie Jin | Caishuang Huang | Junjie Ye | Zhihao Zhang | Yuhao Zhou | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.
TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information collection. How to utilize TOD accurately, efficiently and effectively for information collection has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, **TransferTOD**, which authentically simulates human-computer dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a model using full-parameter fine-tuning called **TransferTOD-7B**, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
Junjie Ye | Yilong Wu | Songyang Gao | Caishuang Huang | Sixian Li | Guanyu Li | Xiaoran Fan | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Junjie Ye | Yilong Wu | Songyang Gao | Caishuang Huang | Sixian Li | Guanyu Li | Xiaoran Fan | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs’ capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce *RoTBench*, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model’s resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.
StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback
Shihan Dou | Yan Liu | Haoxiang Jia | Enyu Zhou | Limao Xiong | Junjie Shan | Caishuang Huang | Xiao Wang | Xiaoran Fan | Zhiheng Xi | Yuhao Zhou | Tao Ji | Rui Zheng | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shihan Dou | Yan Liu | Haoxiang Jia | Enyu Zhou | Limao Xiong | Junjie Shan | Caishuang Huang | Xiao Wang | Xiaoran Fan | Zhiheng Xi | Yuhao Zhou | Tao Ji | Rui Zheng | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. The code and dataset will be made available upon publication.
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
Junjie Ye | Sixian Li | Guanyu Li | Caishuang Huang | Songyang Gao | Yilong Wu | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junjie Ye | Sixian Li | Guanyu Li | Caishuang Huang | Songyang Gao | Yilong Wu | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present ToolSword, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing malicious queries and jailbreak attacks in the input stage, noisy misdirection and risky cues in the execution stage, and harmful feedback and error conflicts in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data will be released upon acceptance of the paper.
Search
Fix author
Co-authors
- Tao Gui 10
- Xuan-Jing Huang (黄萱菁) 10
- Junjie Ye (叶俊杰) 9
- Qi Zhang 7
- Yilong Wu 5
- Shihan Dou 4
- Xiaoran Fan 3
- Songyang Gao 3
- Guanyu Li 3
- Sixian Li 3
- Zhiheng Xi 3
- Qi Zhang 3
- Tao Ji 2
- Senjie Jin 2
- Xiao Wang 2
- Yuran Wang 2
- Ming Zhang 2
- Yunke Zhang 2
- Huiyuan Zheng 2
- Rui Zheng 2
- Enyu Zhou 2
- Meng Zhou 2
- Yuhao Zhou 2
- Mingxu Chai 1
- Lu Chen 1
- Zhuohan Chen 1
- Yurui Dong 1
- Xiku Du 1
- Jianping Fan 1
- Wenjie Fu 1
- Chenhao Huang 1
- Kaixin Huang 1
- Haoxiang Jia 1
- Shichun Liu 1
- Yan Liu 1
- Pu Lu 1
- Yang Nan 1
- Yang Qiao 1
- Xipeng Qiu (邱锡鹏) 1
- Junjie Shan 1
- Yujiong Shen 1
- Zhongchao Shi 1
- Binghai Wang 1
- Peng Wang 1
- Yuhui Wang 1
- Wuwenxi 1
- Limao Xiong 1
- Xueying Xu 1
- Chenyuan Yang 1
- Leyi Yang 1
- Xiaolong Yang 1
- Yuming Yang 1
- Guoqiang Zhang 1
- Jiazheng Zhang 1
- Rongyu Zhang 1
- Zhihao Zhang 1
- Jun Zhao 1
- Wantong Zhao 1
- Dingwei Zhu 1