2024
pdf
bib
abs
STORYSUMM: Evaluating Faithfulness in Story Summarization
Melanie Subbiah
|
Faisal Ladhak
|
Akankshya Mishra
|
Griffin Thomas Adams
|
Lydia Chilton
|
Kathleen McKeown
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, StorySumm, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.
pdf
bib
abs
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
Melanie Subbiah
|
Sean Zhang
|
Lydia B. Chilton
|
Kathleen McKeown
Transactions of the Association for Computational Linguistics, Volume 12
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
2023
pdf
bib
abs
Unsupervised Selective Rationalization with Noise Injection
Adam Storek
|
Melanie Subbiah
|
Kathleen McKeown
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A major issue with using deep learning models in sensitive applications is that they provide no explanation for their output. To address this problem, unsupervised selective rationalization produces rationales alongside predictions by chaining two jointly-trained components, a rationale generator and a predictor. Although this architecture guarantees that the prediction relies solely on the rationale, it does not ensure that the rationale contains a plausible explanation for the prediction. We introduce a novel training technique that effectively limits generation of implausible rationales by injecting noise between the generator and the predictor. Furthermore, we propose a new benchmark for evaluating unsupervised selective rationalization models using movie reviews from existing datasets. We achieve sizeable improvements in rationale plausibility and task accuracy over the state-of-the-art across a variety of tasks, including our new benchmark, while maintaining or improving model faithfulness.
pdf
bib
abs
Check-COVID: Fact-Checking COVID-19 News Claims with Scientific Evidence
Gengyu Wang
|
Kate Harwood
|
Lawrence Chillrud
|
Amith Ananthram
|
Melanie Subbiah
|
Kathleen McKeown
Findings of the Association for Computational Linguistics: ACL 2023
We present a new fact-checking benchmark, Check-COVID, that requires systems to verify claims about COVID-19 from news using evidence from scientific articles. This approach to fact-checking is particularly challenging as it requires checking internet text written in everyday language against evidence from journal articles written in formal academic language. Check-COVID contains 1, 504 expert-annotated news claims about the coronavirus paired with sentence-level evidence from scientific journal articles and veracity labels. It includes both extracted (journalist-written) and composed (annotator-written) claims. Experiments using both a fact-checking specific system and GPT-3.5, which respectively achieve F1 scores of 76.99 and 69.90 on this task, reveal the difficulty of automatically fact-checking both claim types and the importance of in-domain data for good performance. Our data and models are released publicly at
https://github.com/posuer/Check-COVID.
pdf
bib
abs
Towards Detecting Harmful Agendas in News Articles
Melanie Subbiah
|
Amrita Bhattacharjee
|
Yilun Hua
|
Tharindu Kumarage
|
Huan Liu
|
Kathleen McKeown
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Manipulated news online is a growing problem which necessitates the use of automated systems to curtail its spread. We argue that while misinformation and disinformation detection have been studied, there has been a lack of investment in the important open challenge of detecting harmful agendas in news articles; identifying harmful agendas is critical to flag news campaigns with the greatest potential for real world harm. Moreover, due to real concerns around censorship, harmful agenda detectors must be interpretable to be effective. In this work, we propose this new task and release a dataset, NewsAgendas, of annotated news articles for agenda identification. We show how interpretable systems can be effective on this task and demonstrate that they can perform comparably to black-box models.
2022
pdf
bib
abs
SafeText: A Benchmark for Exploring Physical Safety in Language Models
Sharon Levy
|
Emily Allaway
|
Melanie Subbiah
|
Lydia Chilton
|
Desmond Patton
|
Kathleen McKeown
|
William Yang Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.
pdf
bib
abs
Mitigating Covertly Unsafe Text within Natural Language Systems
Alex Mei
|
Anisha Kabir
|
Sharon Levy
|
Melanie Subbiah
|
Emily Allaway
|
John Judge
|
Desmond Patton
|
Bruce Bimber
|
Kathleen McKeown
|
William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system’s information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.