Jie Song
2026
ToolDNA: Autonomous Evolution of Tool Metadata for Robust Dialogue Agents
Qiuyuan Ai | Cong Wang | Jiaqi Zhang | Zengxin Han | Jie Song
Findings of the Association for Computational Linguistics: ACL 2026
Qiuyuan Ai | Cong Wang | Jiaqi Zhang | Zengxin Han | Jie Song
Findings of the Association for Computational Linguistics: ACL 2026
Task-oriented dialogue (TOD) systems are vital for facilitating complex, goal-directed interactions across sectors like customer support and online retail. However, they face persistent limitations: labor-intensive manual metadata tuning and sparse reinforcement learning (RL) rewards that fail to diagnose invocation errors. To address this, we propose ToolDNA, a dynamic adaptation framework enabling autonomous co-evolution of policy networks and tool metadata via RL, anchored by two synergistic loops. An RL loop optimizes policies by generating rollout trajectories (reasoning, actions, descriptive updates) from user inputs, with multi-dimensional rewards refining invocations. A tool metadata loop—coordinated by a dedicated Tool Manager—evolves metadata through policy-generated candidates during rollouts and Feedback LLM-derived refinements from historical data. These mutually reinforcing loops close traditional reward gaps, forming a closed-loop trial-error-reflection cycle for self-improvement. Extensive experiments on a real-world dataset of 3,100 customer service dialogues confirm ToolDNA’s superiority, with notable gains over baselines: it achieves +11% problem resolution and +54% accuracy over commercial LLMs with prompt engineering; +25%/+35% over supervised fine-tuning; and +15%/+15% over traditional RL baseline. Linguistic analysis corroborates evolved metadata retain semantic intent while enhancing parseability. Case studies in two typical contexts, i.e., car inventory search and loan calculation, further validates its ability to resolve critical ambiguities. ToolDNA pioneers scalable self-improvement for robust, deployable tool-augmented agents with minimal human oversight. We release our code to facilitate future research.
Evolutionary Negative Module Pruning for Better LoRA Merging
Anda Cao | Zhuo Gou | Yi Wang | Kaixuan Chen | Yu Wang | Can Wang | Mingli Song | Jie Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Anda Cao | Zhuo Gou | Yi Wang | Kaixuan Chen | Yu Wang | Can Wang | Mingli Song | Jie Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Merging multiple Low-Rank Adaptation (LoRA) experts into a single backbone is a promising approach for efficient multi-task deployment. While existing methods strive to alleviate interference via weight interpolation or subspace alignment, they rest upon the implicit assumption that all LoRA matrices contribute constructively to the merged model. In this paper, we uncover a critical bottleneck in current merging paradigms: the existence of negative modules—specific LoRA layers that inherently degrade global performance upon merging. We propose Evolutionary Negative Module Pruning (ENMP), a plug-and-play LoRA pruning method to locate and exclude these detrimental modules prior to merging. By leveraging an evolutionary search strategy, ENMP effectively navigates the discrete, non-differentiable landscape of module selection to identify optimal pruning configurations. Extensive evaluations demonstrate that ENMP consistently boosts the performance of existing merging algorithms, achieving a new state-of-the-art across both language and vision domains. Code is available at https://github.com/CaoAnda/ENMP-LoRAMerging.
Cognitive Scaffold: From Fluid Context to Crystallized Memory for Long-Horizon DeepResearch Agents
Qiuyuan Ai | Zenghuang Fu | Zhaoyang Li | Ping Jiang | Haoyu Wu | Jie Song | Guannan He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiuyuan Ai | Zenghuang Fu | Zhaoyang Li | Ping Jiang | Haoyu Wu | Jie Song | Guannan He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Scaling LLM-based agents to long-horizon deep research is constrained by the context-noise trade-off, where linear history accumulation degrades reasoning and dilutes fine-grained evidence. To address this, we introduce the Cognitive Scaffold, a factorized memory architecture that decouples the cognitive state into a Fluid Working Context for immediate reasoning and a persistent Knowledge Graph for long-term retention. Unlike unstructured summarization, our framework employs a Rejection Sampling Fine-Tuning (RFT) pipeline to crystallize saturated context into structured event snapshots, strictly enforcing atomic constraints to preserve numerical values and entities. During reasoning, a thought-driven dual-path retrieval mechanism enables the agent to proactively recover precise evidence. Empirical evaluations on Xbench-DeepSearch, BrowseComp-ZH, and GAIA demonstrate that Cognitive Scaffold consistently outperforms baselines, achieving 74.7% Avg@3 and 87.0% Pass@3 on Xbench-DeepSearch, 48.5% Avg@3 and 65.9% Pass@3 on BrowseComp-ZH, and 72.8% Avg@3 and 88.3% Pass@3 on GAIA, while reducing compression hallucinations to 5.3%. We open-source our codebase to facilitate future research.
2024
Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement
Pengwei Zhan | Zhen Xu | Qian Tan | Jie Song | Ru Xie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Pengwei Zhan | Zhen Xu | Qian Tan | Jie Song | Ru Xie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) demonstrate exceptional instruct-following ability to complete various downstream tasks. Although this impressive ability makes LLMs flexible task solvers, their performance in solving tasks also heavily relies on instructions. In this paper, we reveal that LLMs are over-sensitive to lexical variations in task instructions, even when the variations are imperceptible to humans. By providing models with neighborhood instructions, which are closely situated in the latent representation space and differ by only one semantically similar word, the performance on downstream tasks can be vastly different. Following this property, we propose a black-box Combinatorial Optimization framework for Prompt Lexical Enhancement (COPLE). COPLE performs iterative lexical optimization according to the feedback from a batch of proxy tasks, using a search strategy related to word influence. Experiments show that even widely-used human-crafted prompts for current benchmarks suffer from the lexical sensitivity of models, and COPLE recovers the declined model ability in both instruct-following and solving downstream tasks.