The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors

Abdelrahman Sadallah; Tim Baumgärtner; Iryna Gurevych; Ted Briscoe

doi:10.18653/v1/2025.emnlp-main.1476

The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors

Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, Ted Briscoe

Abstract

Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.

Anthology ID:: 2025.emnlp-main.1476
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28991–29021
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1476/
DOI:: 10.18653/v1/2025.emnlp-main.1476
Bibkey:
Cite (ACL):: Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, and Ted Briscoe. 2025. The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28991–29021, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors (Sadallah et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1476.pdf
Checklist:: 2025.emnlp-main.1476.checklist.pdf

PDF Cite Search Checklist Fix data