Identifying Reliable Evaluation Metrics for Scientific Text Revision

Léane Jourdan; Nicolas Hernandez; Florian Boudin; Richard Dufour

doi:10.18653/v1/2025.acl-long.335

Identifying Reliable Evaluation Metrics for Scientific Text Revision

Leane Jourdan, Nicolas Hernandez, Florian Boudin, Richard Dufour

Abstract

Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision.

Anthology ID:: 2025.acl-long.335
Original:: 2025.acl-long.335v1
Version 2:: 2025.acl-long.335v2
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6731–6756
Language:
URL:: https://aclanthology.org/2025.acl-long.335/
DOI:: 10.18653/v1/2025.acl-long.335
Bibkey:
Cite (ACL):: Leane Jourdan, Nicolas Hernandez, Florian Boudin, and Richard Dufour. 2025. Identifying Reliable Evaluation Metrics for Scientific Text Revision. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6731–6756, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Identifying Reliable Evaluation Metrics for Scientific Text Revision (Jourdan et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.335.pdf

PDF (v2) PDF (v1) Cite Search Fix data