2024
pdf
bib
abs
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios
Zetian Ouyang
|
Yishuai Qiu
|
Linlin Wang
|
Gerard De Melo
|
Ya Zhang
|
Yanfeng Wang
|
Liang He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.
pdf
bib
abs
RaTEScore: A Metric for Radiology Report Generation
Weike Zhao
|
Chaoyi Wu
|
Xiaoman Zhang
|
Ya Zhang
|
Yanfeng Wang
|
Weidi Xie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
This paper introduces a novel, entity-aware metric, termed as Radiological Report (Text) Evaluation (RaTEScore), to assess the quality of medical reports generated by AI models. RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Technically, we developed a comprehensive medical NER dataset, RaTE-NER, and trained an NER model specifically for this purpose. This model enables the decomposition of complex radiological reports into constituent medical entities. The metric itself is derived by comparing the similarity of entity embeddings, obtained from a language model, based on their types and relevance to clinical significance. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
pdf
bib
abs
DictLLM: Harnessing Key-Value Data Structures with Large Language Models for Enhanced Medical Diagnostics
YiQiu Guo
|
Yuchen Yang
|
Ya Zhang
|
Yu Wang
|
Yanfeng Wang
Findings of the Association for Computational Linguistics: ACL 2024
Structured data offers an efficient means of organizing information. Exsisting text-serialization based methods for processing structured data using large language models (LLMs) are not designed to explicitly capture the heterogeneity of structured data. Such methods are suboptimal for LLMs to process structured data, and may lead to large input token size and poor robustness to input perturbation. In this paper, we propose a novel framework called DictLLM, which is an efficient and effective framework for the modeling of medical lab report to deal with the report-assisted diagnosis generation task. DictLLM introduce 1) group positional encoding to maintain the permutation invariance, 2) hierarchical attention bias to capture the inductive bias of structured data, and 3) a optimal transport alignment layer to align the embeddings generated by the dict encoder with the LLM, producing a list of fixed-length virtual tokens. We conduct experiments with multiple LLM models on a large-scale real-world medical lab report dataset for automatic diagnosis generation. The results show that our proposed framework outperforms the baseline methods and few-shot GPT-4 in terms of both Rouge-L and Knowledge F1 score. We also conduct multiple experiments and analyze the scalability and robustness of our proposed framework, demonstrating the superiority of our method in modeling the heterogeneous structure of medical dictionaries data.
pdf
bib
abs
HSDreport: Heart Sound Diagnosis with Echocardiography Reports
Zihan Zhao
|
Pingjie Wang
|
Liudan Zhao
|
Yuchen Yang
|
Ya Zhang
|
Kun Sun
|
Xin Sun
|
Xin Zhou
|
Yu Wang
|
Yanfeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2024
Heart sound auscultation holds significant importance in the diagnosis of congenital heart disease. However, existing methods for Heart Sound Diagnosis (HSD) tasks are predominantly limited to a few fixed categories, framing the HSD task as a rigid classification problem that does not fully align with medical practice and offers only limited information to physicians. Besides, such methods do not utilize echocardiography reports, the gold standard in the diagnosis of related diseases. To tackle this challenge, we introduce HSDreport, a new benchmark for HSD, which mandates the direct utilization of heart sounds obtained from auscultation to predict echocardiography reports. This benchmark aims to merge the convenience of auscultation with the comprehensive nature of echocardiography reports. First, we collect a new dataset for this benchmark, comprising 2,275 heart sound samples along with their corresponding reports. Subsequently, we develop a knowledge-aware query-based transformer to handle this task. The intent is to leverage the capabilities of medically pre-trained models and the internal knowledge of large language models (LLMs) to address the task’s inherent complexity and variability, thereby enhancing the robustness and scientific validity of the method. Furthermore, our experimental results indicate that our method significantly outperforms traditional HSD approaches and existing multimodal LLMs in detecting key abnormalities in heart sounds.
2008
pdf
bib
A Two-Stage Approach to Chinese Part-of-Speech Tagging
Aitao Chen
|
Ya Zhang
|
Gordon Sun
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing