SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization

Philippe Laban; Wojciech Kryściński; Divyansh Agarwal; Alexander Richard Fabbri; Caiming Xiong; Shafiq Joty; Chien-Sheng Wu

doi:10.18653/v1/2023.emnlp-main.600

SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization

Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, Alexander Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu

Abstract

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs’ ability to reason about facts and detect inconsistencies when they occur.

Anthology ID:: 2023.emnlp-main.600
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9662–9676
Language:
URL:: https://aclanthology.org/2023.emnlp-main.600/
DOI:: 10.18653/v1/2023.emnlp-main.600
Bibkey:
Cite (ACL):: Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, Alexander Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. 2023. SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, Singapore. Association for Computational Linguistics.
Cite (Informal):: SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization (Laban et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.600.pdf
Video:: https://aclanthology.org/2023.emnlp-main.600.mp4

PDF Cite Search Video Fix data