Understanding Faithfulness and Reasoning of Large Language Models on Plain Biomedical Summaries

Biaoyan Fang, Xiang Dai, Sarvnaz Karimi


Abstract
Generating plain biomedical summaries with Large Language Models (LLMs) can enhance the accessibility of biomedical knowledge to the public. However, how faithful the generated summaries are remains an open yet critical question. To address this, we propose FaReBio, a benchmark dataset with expert-annotated Faithfulness and Reasoning on plain Biomedical Summaries. This dataset consists of 175 plain summaries ($,445 sentences) generated by seven different LLMs, paired with source articles. Using our dataset, we identify the performance gap of LLMs in generating faithful plain biomedical summaries and observe a negative correlation between abstractiveness and faithfulness. We also show that current faithfulness evaluation metrics do not work well in the biomedical domain and confirm the over-confident tendency of LLMs as faithfulness evaluators. To better understand the faithfulness judgements, we further benchmark LLMs in retrieving supporting evidence and show the gap of LLMs in reasoning faithfulness evaluation at different abstractiveness levels. Going beyond the binary faithfulness labels, coupled with the annotation of supporting sentences, our dataset could further contribute to the understanding of faithfulness evaluation and reasoning.
Anthology ID:
2024.findings-emnlp.578
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9890–9911
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.578
DOI:
Bibkey:
Cite (ACL):
Biaoyan Fang, Xiang Dai, and Sarvnaz Karimi. 2024. Understanding Faithfulness and Reasoning of Large Language Models on Plain Biomedical Summaries. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9890–9911, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Understanding Faithfulness and Reasoning of Large Language Models on Plain Biomedical Summaries (Fang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.578.pdf