Jonathan Bragg


2024

pdf bib
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Mike D’Arcy | Alexis Ross | Erin Bransom | Bailey Kuehl | Jonathan Bragg | Tom Hope | Doug Downey
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment—especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4’s ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.

2022

pdf bib
GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation
Daniel Khashabi | Gabriel Stanovsky | Jonathan Bragg | Nicholas Lourie | Jungo Kasai | Yejin Choi | Noah A. Smith | Daniel Weld
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research.We revisit this problem with a focus on producing consistent evaluations that are reproducible—over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annotation interface used to elicit human judgments and their impact on reproducibility. Furthermore, we develop an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators. Putting these lessons together, we introduce GENIE: a system for running standardized human evaluations across different generation tasks.We instantiate GENIE with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension.For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency.We have made the GENIE leaderboards publicly available, and have already ranked 50 submissions from 10 different research groups. We hope GENIE encourages further progress toward effective, standardized evaluations for text generation.

2016

pdf bib
Effective Crowd Annotation for Relation Extraction
Angli Liu | Stephen Soderland | Jonathan Bragg | Christopher H. Lin | Xiao Ling | Daniel S. Weld
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies