ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective

Andra-Maria Florescu; Marius Micluța-Câmpeanu; Ștefana Arina Tăbușcă; Liviu P. Dinu

ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective

Andra-Maria Florescu, Marius Micluța-Câmpeanu, Stefana Arina Tabusca, Liviu P Dinu

Abstract

The following paper is a joint contribution for the 2025 ReproNLP shared task, part of the ReproHum project. We focused on reproducing the human evaluation based on one criterion, namely, factuality of Scientific Automated Generated Systems from August et al. (2022). In accordance to the ReproHum guidelines, we followed the original study as closely as possible, with two human raters who coded 300 ratings each. Moreover, we had an additional study on two subsets of the dataset based on domain (medicine and physics) in which we employed expert annotators. Our reproduction of the factuality assessment found similar overall rates of factual inaccuracies across models. However, variability and weak agreement with the original model rankings suggest challenges in reliably reproducing results, especially in such cases when results are close.

Anthology ID:: 2025.gem-1.53
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shmueli Scheuer, Gabriel Stanovsky, Oyvind Tafjord
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 583–589
Language:
URL:: https://aclanthology.org/2025.gem-1.53/
DOI:
Bibkey:
Cite (ACL):: Andra-Maria Florescu, Marius Micluța-Câmpeanu, Stefana Arina Tabusca, and Liviu P Dinu. 2025. ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 583–589, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective (Florescu et al., GEM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.gem-1.53.pdf

PDF Cite Search Fix data