2025
pdf
bib
abs
LLMs Can Simulate Standardized Patients via Agent Coevolution
Zhuoyun Du
|
LujieZheng LujieZheng
|
Renjun Hu
|
Yuyang Xu
|
Xiawei Li
|
Ying Sun
|
Wei Chen
|
Jian Wu
|
Haolei Cai
|
Haochao Ying
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training medical personnel using standardized patients (SPs) remains a complex challenge, requiring extensive domain expertise and role-specific practice. Most research on Large Language Model (LLM)-based simulated patients focuses on improving data retrieval accuracy or adjusting prompts through human feedback. However, this focus has overlooked the critical need for patient agents to learn a standardized presentation pattern that transforms data into human-like patient responses through unsupervised simulations. To address this gap, we propose EvoPatient, a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues, simultaneously gathering experience to improve the quality of both questions and answers, ultimately enabling human doctor training. Extensive experiments on various cases demonstrate that, by providing only overall SP requirements, our framework improves over existing reasoning methods by more than 10% in requirement alignment and better human preference, while achieving an optimal balance of resource consumption after evolving over 200 cases for 10 hours, with excellent generalizability. Our system will be available at https://github.com/ZJUMAI/EvoPatient
pdf
bib
abs
AutoEvolve: Automatically Evolving Queries for Applicable and Scalable Retrieval-Augmented Generation Benchmarking
Ding-Chu Zhang
|
Xiaowen Zhang
|
Yue Fei
|
Renjun Hu
|
Xiao-Wen Yang
|
Zhi Zhou
|
Baixuan Li
|
Yu-Feng Li
|
Xing Shi
|
Wei Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-augmented generation (RAG) enables large language models (LLMs) to address queries beyond their internal knowledge by integrating domain knowledge in specialized corpus, which necessitates the generation of benchmarks on specific corpus to evaluate RAG systems. However, existing automated generation methods exhibit Weak Applicability and Weak Scalability. Weak Applicability refers to the reliance on metadata from specific corpora for query generation, constraining applicability to other corpora. Weak Scalability is characterized by fixed query content after generation, unable to dynamically increase difficulty, limiting scalability of the query. To overcome these issues, we propose AutoEvolve, an applicable approach for dynamically evolving queries to construct scalable RAG benchmarks. Our approach is grounded in three key innovations: (i) a corpus-agnostic method for constructing the universal entity-document graph; (ii) a suite of evolution operations designed to dynamically update queries; and (iii) a difficulty-guided metric that directs query evolution process. Through experiments on three generated benchmarks, we demonstrate that AutoEvolve evolves queries that are significantly more challenging, paving the way for more applicable and scalable RAG evaluations.
pdf
bib
abs
Summarize-Exemplify-Reflect: Data-driven Insight Distillation Empowers LLMs for Few-shot Tabular Classification
Yifei Yuan
|
Jiatong Li
|
Weijia Zhang
|
Mohammad Aliannejadi
|
Evangelos Kanoulas
|
Renjun Hu
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent studies show the promise of large language models (LLMs) for few-shot tabular classification but highlight challenges due to the variability in structured data. To address this, we propose distilling data into actionable insights to enable robust and effective classification by LLMs. Drawing inspiration from human learning processes, we introduce InsightTab, an insight distillation framework guided by principles of divide-and-conquer, easy-first, and reflective learning. Our approach integrates rule summarization, strategic exemplification, and insight reflection through deep collaboration between LLMs and data modeling techniques. The obtained insights enable LLMs to better align their general knowledge and capabilities with the particular requirements of specific tabular tasks. We extensively evaluate InsightTab on nine datasets. The results demonstrate consistent improvement over state-of-the-art methods. Ablation studies further validate the principle-guided distillation process, while analyses emphasize InsightTab’s effectiveness in leveraging labeled data and managing bias.
2024
pdf
bib
abs
CO3: Low-resource Contrastive Co-training for Generative Conversational Query Rewrite
Yifei Yuan
|
Chen Shi
|
Wang Runze
|
Liyi Chen
|
Renjun Hu
|
Zengming Zhang
|
Feijun Jiang
|
Wai Lam
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Generative query rewrite generates reconstructed query rewrites using the conversation history while rely heavily on gold rewrite pairs that are expensive to obtain. Recently, few-shot learning is gaining increasing popularity for this task, whereas these methods are sensitive to the inherent noise due to limited data size. Besides, both attempts face performance degradation when there exists language style shift between training and testing cases. To this end, we study low-resource generative conversational query rewrite that is robust to both noise and language style shift. The core idea is to utilize massive unlabeled data to make further improvements via a contrastive co-training paradigm. Specifically, we co-train two dual models (namely Rewriter and Simplifier) such that each of them provides extra guidance through pseudo-labeling for enhancing the other in an iterative manner. We also leverage contrastive learning with data augmentation, which enables our model pay more attention on the truly valuable information than the noise. Extensive experiments demonstrate the superiority of our model under both few-shot and zero-shot scenarios. We also verify the better generalization ability of our model when encountering language style shift.