Xuan Zhou


2026

Group bias in Large Language Models (LLMs) is a well-documented issue, its impact in high-stakes domains such as personalized nutritional advice remains under explored. This study introduces the USChainMains dataset to systematically evaluate LLMs, prompting them with names associated with specific racial and gender groups and rigorously quantifying the healthfulness of the generated meal recommendations against established dietary standards. The findings demonstrate that LLMs systematically recommend meals with significantly higher levels of adverse nutrients for names associated with Black, Hispanic, or male individuals, thereby reflecting and potentially reinforcing detrimental dietary stereotypes. Furthermore, our analysis of two common mitigation strategies reveals their limitations. While model scaling improves overall recommendation healthfulness, it is insufficient to eliminate the healthfulness gap between demographic groups. Notably, while augmented reasoning was effective in mitigating gender bias, it did not mitigate racial disparities. This work underscores the necessity of developing more nuanced, group-aware debiasing techniques to ensure AI-driven systems advance, rather than hinder, health equity.

2024

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs

2021

Although pre-trained big models (e.g., BERT, ERNIE, XLNet, GPT3 etc.) have delivered top performance in Seq2seq modeling, their deployments in real-world applications are often hindered by the excessive computations and memory demand involved. For many applications, including named entity recognition (NER), matching the state-of-the-art result under budget has attracted considerable attention. Drawing power from the recent advance in knowledge distillation (KD), this work presents a novel distillation scheme to efficiently transfer the knowledge learned from big models to their more affordable counterpart. Our solution highlights the construction of surrogate labels through the k-best Viterbi algorithm to distill knowledge from the teacher model. To maximally assimilate knowledge into the student model, we propose a multi-grained distillation scheme, which integrates cross entropy involved in conditional random field (CRF) and fuzzy learning. To validate the effectiveness of our proposal, we conducted a comprehensive evaluation on five NER benchmarks, reporting cross-the-board performance gains relative to competing prior-arts. We further discuss ablation results to dissect our gains.