Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan Tn


Abstract
In recent years, large language models (LLMs) have drawn significant attention due to their impressive emergent capabilities that were not observed in earlier language models. One emerging area where LLMs have been widely used in recent times is the utilization of LLMs as the evaluator of the texts generated by various generative models. In this paper, we also explore the possibility of whether LLMs are reliable in assessing the factual consistency of summaries generated by text generation models. We first propose a new approach to evaluate the factuality score using LLMs by utilizing the same LLM to perform all steps in the question-answering-based factuality scoring pipeline. Subsequently, we study the performance of various LLMs to directly score the factuality. Our evaluation is conducted in traditional benchmarks by comparing their correlation with human annotations. Contrary to expectations, our findings revealed that none of the factuality metrics showed any significant correlations (e.g., coefficient scores greater than 0.3) to human evaluations of factuality for GPT-4, PaLM-2, and Claude-2, with the only exception being GPT-3.5 in two subcategories of factuality. Nonetheless, our findings are consistent across almost all factual error types, suggesting a fundamental limitation in the ability of current LLMs to assess factuality.
Anthology ID:
2023.gem-1.25
Volume:
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, Hooman Sedghamiz
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
310–316
Language:
URL:
https://aclanthology.org/2023.gem-1.25
DOI:
Bibkey:
Cite (ACL):
Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, and Shashi Bhushan Tn. 2023. Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 310–316, Singapore. Association for Computational Linguistics.
Cite (Informal):
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs (Fu et al., GEM-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.gem-1.25.pdf