Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment

Go Kamoda, Akari Asai, Ana Brassard, Keisuke Sakaguchi


Abstract
Evaluating the outputs of large language models (LLMs) on long-form generative tasks remains challenging. While fine-grained, aspect-wise evaluations provide valuable diagnostic information, they are difficult to design exhaustively, and each aspect’s contribution to the overall acceptability of an answer is unclear. In this study, we propose a method to compute an overall quality score as a weighted average of three key aspects: factuality, informative- ness, and formality. This approach achieves stronger correlations with human judgments compared to previous metrics. Our analysis identifies factuality as the most predictive aspect of overall quality. Additionally, we release a dataset of 1.2k long-form QA answers annotated with both absolute judgments and relative preferences in overall and aspect-wise schemes to aid future research in evaluation practices.
Anthology ID:
2025.coling-main.588
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8787–8808
Language:
URL:
https://aclanthology.org/2025.coling-main.588/
DOI:
Bibkey:
Cite (ACL):
Go Kamoda, Akari Asai, Ana Brassard, and Keisuke Sakaguchi. 2025. Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8787–8808, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment (Kamoda et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.588.pdf