Jilong Kuang

2026

Compact Language Models with Iterative Text Refinement for Health Dialogue Summarization
Kellen Tan Cheng | Ganesh Ramesh | Nafiul Rashid | Geoffrey Jay Tso | Jilong Kuang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Health wellness agents typically rely on large language models (LLMs) for response generation, where contextual information from a user’s conversation history can be used for response grounding and personalization. High-quality conversation summaries are one such method which can reduce the number of input tokens during response generation, decreasing overhead and inference latency. However, directly purposing LLMs for this task is infeasible due to the scale of the task, the compute overhead, and health data compliance regulations. Furthermore, ground truth for real-world datasets is scarce due to privacy concerns and the high cost of health expert annotators. These factors necessitate the development of small, potentially on-device, language models capable of health dialogue summarization, particularly in the absence of ground truth labels. In this paper, we first present a comprehensive empirical study that benchmarks a variety of state-of-the-art smaller language models to better understand their baseline capabilities. Second, we present an unsupervised method that uses the summaries from multiple models, refined with iterative feedback, to generate high-quality summaries of health dialogues. Experiments illustrate that our method is able to outperform baseline on both open-source and proprietary benchmarks. Notably, our method can be run viably on local compute without a GPU, using just a single Macbook with 16 GB of memory.

2025

pdf bib abs

Enhancing LLM-as-a-Judge through Active-Sampling-based Prompt Optimization
Cheng Zhen | Ervine Zheng | Jilong Kuang | Geoffrey Jay Tso
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

We introduce an active-sampling-based framework for automatic prompt optimization, designed to enhance the performance of Large Language Model (LLM)-as-a-judge systems, which use LLMs to evaluate the quality of text or other outputs, in label-scarce settings. Unlike existing approaches that rely on extensive annotations, our method starts with no labeled data and iteratively selects and labels a small, diverse, and informative subset of samples to guide prompt refinement. At each iteration, our method evaluates the current prompt based on selected data and automatically updates the prompt, enabling efficient prompt optimization with minimal supervision. Moreover, we formulate sample selection as a convex optimization problem that balances uncertainty and diversity, maximizing the utility of limited labeling budgets. We validate our framework across four popular LLMs and three real-world datasets, including one from a deployed industry product. Results show that our optimized prompts consistently outperform baselines, achieving significant gains in evaluation quality and robustness while substantially reducing labeling costs.

pdf bib abs

Data-Efficient Automatic Prompt Optimization for Memory-Enhanced Conversational Agents
Ervine Zheng | Yikuan Li | Geoffrey Jay Tso | Jilong Kuang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Automatic prompt optimization (APO) uses algorithms to automatically refine prompts for LLMs, effectively reducing human effort in prompt engineering. However, applying APO to memory-enhanced conversational agents presents unique challenges. These agents leverage memory to retain information from historical interactions with users and provide context-aware and personalized responses. Optimizing prompts for these agents is challenging due to their complex, interconnected modules that include memory writing, reading, and response generation. This paper introduces a data-efficient framework for APO in these agents. Our approach leverages LLMs to holistically optimize the prompts of all agents. We also introduce an automated evaluation module that not only provides a holistic quality score for responses but also performs error attribution, pinpointing failures within the specific modules. More importantly, to ensure the evaluation module aligns with human judgment, we develop a data-efficient active sampling algorithm with convex optimization to select the most informative samples for human feedback and prompt improvement. We conducted experiments on two health-related conversation datasets to demonstrate the effectiveness of the proposed framework.

Co-authors

Nafiul Rashid 1

Cheng Zhen 1

Venues

Fix author