Wentao Zhang
Papers on this page may belong to the following people: Wentao Zhang, Wentao Zhang
2026
The Data Frontier for Large Language Models: Selection, Synthesis, and Tools
Lijun Wu | Wentao Zhang | Conghui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
Lijun Wu | Wentao Zhang | Conghui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
As the development of Large Language Models (LLMs) matures, the focus of the research community is undergoing a critical shift from a purely model-centric to a data-centric paradigm. It is now evident that the quality, diversity, and composition of training data—not merely its scale—are the primary drivers of a model’s advanced capabilities, from complex reasoning to reliable instruction following. However, acquiring and curating such high-quality data remains a significant bottleneck. This tutorial provides a comprehensive and practical guide to the state-of-the-art in data research directions for LLMs. We structure the tutorial around the two core pillars of modern data strategy: intelligent data selection and advanced data synthesis. In the first part, we delve into methods for curating the most valuable information from vast, noisy datasets, covering techniques like LLM-as-a-judge for automated quality filtering and active learning for maximizing annotation efficiency. The second part explores the synthetic data revolution, detailing paradigms that range from generating complex reasoning traces (e.g., Chain-of-Thought) to deploying sophisticated multi-agent workflows that can autonomously create high-quality, diverse instruction data from raw seeds. Finally, we will conclude with a practical overview of open-source tools and platforms that facilitate these data-centric workflows, empowering researchers and practitioners to build better models through better data. Attendees will leave with a principled framework and actionable insights for designing and implementing the advanced data strategies required to build the next generation of powerful, specialized, and aligned LLMs.
Trust Within? Seek Beyond? Knowledge Boundary Aware Policy Optimization for Agentic Search
Tao Feng | Xinke Jiang | Xinyan Hu | Yonggang Zhang | Zhen Tao | Wentao Zhang | Boyang Liu | Wenhao Jiang | Chao Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tao Feng | Xinke Jiang | Xinyan Hu | Yonggang Zhang | Zhen Tao | Wentao Zhang | Boyang Liu | Wenhao Jiang | Chao Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Agentic search augments large language models (LLMs) with external knowledge through reinforcement learning. However, existing approaches suffer from blind reliance on noisy retrieval and hallucination when both parametric and external knowledge fail—reflecting a lack of calibration regarding the model’s knowledge boundary. We propose Knowledge boundary Policy Optimization (KbPO), a reinforcement learning framework that explicitly aligns retrieval decisions with quantified knowledge states. KbPO introduces: (1) a semantic stability metric to delineate reliable parametric knowledge; (2) a four-quadrant taxonomy synthesising internal certainty with retrieval quality; and (3) a quadrant-based reward mechanism incentivising calibrated behaviour. We further adopt an iterative query evolution pipeline to construct boundary-probing training samples. Experiments on ten benchmarks demonstrate that KbPO outperforms strong baselines while exhibiting reduced hallucination rates.
2025
From Chat Logs to Collective Insights: Aggregative Question Answering
Wentao Zhang | Woojeong Kim | Yuntian Deng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Wentao Zhang | Woojeong Kim | Yuntian Deng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet existing approaches typically treat these interactions as independent, missing critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregational queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.
Interactive Training: Feedback-Driven Neural Network Optimization
Wentao Zhang | Yang Young Lu | Yuntian Deng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Wentao Zhang | Yang Young Lu | Yuntian Deng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolves instabilities, and optimizes training dynamics.
TC–RAG: Turing–Complete RAG’s Case study on Medical LLM Systems
Xinke Jiang | Yue Fang | Rihong Qiu | Haoyu Zhang | Yongxin Xu | Hao Chen | Wentao Zhang | Ruizhe Zhang | Yuchen Fang | Xinyu Ma | Xu Chu | Junfeng Zhao | Yasha Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinke Jiang | Yue Fang | Rihong Qiu | Haoyu Zhang | Yongxin Xu | Hao Chen | Wentao Zhang | Ruizhe Zhang | Yuchen Fang | Xinyu Ma | Xu Chu | Junfeng Zhao | Yasha Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In the pursuit of enhancing domain-specific Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) emerges as a promising solution to mitigate issues such as hallucinations, outdated knowledge, and limited expertise in highly specialized queries. However, existing approaches to RAG fall short by neglecting system state variables, which are crucial for ensuring adaptive control, retrieval halting, and system convergence. In this paper, we introduce the Turing-Complete-RAG (TC-RAG) through rigorous proof, a novel framework that addresses these challenges by incorporating a Turing Complete System to manage state variables, thereby enabling more efficient and accurate knowledge retrieval. By leveraging a memory stack system with adaptive retrieval, reasoning, and planning capabilities, TC-RAG not only ensures the controlled halting of retrieval processes but also mitigates the accumulation of erroneous knowledge via Push and Pop actions. In the case study of the medical and general domain, our extensive experiments on seven real-world healthcare and general-domain datasets demonstrate the superiority of TC-RAG over existing methods in accuracy by over 7.20%. Our code, datasets and RAG resources have been available at https://github.com/Artessay/TC-RAG.
2024
ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training
Le Zhuo | Zewen Chi | Minghao Xu | Heyan Huang | Jianan Zhao | Heqi Zheng | Conghui He | Xian-Ling Mao | Wentao Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Le Zhuo | Zewen Chi | Minghao Xu | Heyan Huang | Jianan Zhao | Heqi Zheng | Conghui He | Xian-Ling Mao | Wentao Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We propose ProtLLM, a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word language modeling approach to train ProtLLM. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates. Additionally, we construct a large-scale interleaved protein-text dataset, named InterPT, for pre-training. This dataset comprehensively encompasses both (1) structured data sources like protein annotations and (2) unstructured data sources like biological research papers, thereby endowing ProtLLM with crucial knowledge for understanding proteins. We evaluate ProtLLM on classic supervised protein-centric tasks and explore its novel protein-language applications. Experimental results demonstrate that ProtLLM not only achieves superior performance against protein-specialized baselines on protein-centric tasks but also induces zero-shot and in-context learning capabilities on protein-language tasks.
2023
Patton: Language Model Pretraining on Text-Rich Networks
Bowen Jin | Wentao Zhang | Yu Zhang | Yu Meng | Xinyang Zhang | Qi Zhu | Jiawei Han
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bowen Jin | Wentao Zhang | Yu Zhang | Yu Meng | Xinyang Zhang | Qi Zhu | Jiawei Han
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A real-world text corpus sometimes comprises not only text documents, but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships).Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval. However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks. Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration. To this end, we propose our PretrAining on TexT-Rich NetwOrk framework Patton.Patton includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure. We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where Patton outperforms baselines significantly and consistently.
Search
Fix author
Co-authors
- Yuntian Deng 2
- Conghui He 2
- Xinke Jiang 2
- Hao Chen 1
- Zewen Chi 1
- Xu Chu 1
- Yuchen Fang 1
- Yue Fang 1
- Tao Feng 1
- Jiawei Han 1
- Xinyan Hu 1
- He-Yan Huang (黄河燕) 1
- Wenhao Jiang 1
- Bowen Jin 1
- Woojeong Kim 1
- Boyang Liu 1
- Yang Young Lu 1
- Xinyu Ma 1
- Xian-Ling Mao 1
- Yu Meng 1
- Rihong Qiu 1
- Zhen Tao 1
- Yasha Wang 1
- Chao Wu 1
- Lijun Wu 1
- Minghao Xu 1
- Yongxin Xu 1
- Haoyu Zhang 1
- Ruizhe Zhang 1
- Xinyang Zhang 1
- Yonggang Zhang 1
- Yu Zhang 1
- Jianan Zhao 1
- Junfeng Zhao 1
- Heqi Zheng 1
- Qi Zhu 1
- Le Zhuo 1