Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Sanjana Ramprasad, Elisa Ferracane, Zachary Lipton


Abstract
Recent advancements in large language models (LLMs) have significantly advanced the capabilities of summarization systems.However, they continue to face a persistent challenge: hallucination. While prior work has extensively examined LLMs in news domains, evaluation of dialogue summarization has primarily focused on BART-based models, resulting in a notable gap in understanding LLM effectiveness.Our work seeks to address this gap by benchmarking LLMs for dialogue summarization faithfulness using human annotations,focusing on identifying and categorizing span-level inconsistencies.Specifically, we evaluate two prominent LLMs: GPT-4 and Alpaca-13B.Our evaluation reveals that LLMs often generate plausible, but not fully supported inferences based on conversation contextual cues, a trait absent in older models. As a result, we propose a refined taxonomy of errors, introducing a novel category termed “Contextual Inference” to address this aspect of LLM behavior. Using our taxonomy, we compare the behavioral differences between LLMs and older fine-tuned models. Additionally, we systematically assess the efficacy of automatic error detection methods on LLM summaries and find that they struggle to detect these nuanced errors effectively. To address this, we introduce two prompt-based approaches for fine-grained error detection. Our methods outperform existing metrics, particularly in identifying the novel “Contextual Inference” error type.
Anthology ID:
2024.acl-long.677
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12549–12561
Language:
URL:
https://aclanthology.org/2024.acl-long.677
DOI:
10.18653/v1/2024.acl-long.677
Bibkey:
Cite (ACL):
Sanjana Ramprasad, Elisa Ferracane, and Zachary Lipton. 2024. Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12549–12561, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends (Ramprasad et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.677.pdf