A Closer Look into Using Large Language Models for Automatic Evaluation

Cheng-Han Chiang, Hung-yi Lee


Abstract
Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some existing prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze *LLM evaluation* and *G-Eval*, and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.
Anthology ID:
2023.findings-emnlp.599
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8928–8942
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.599
DOI:
10.18653/v1/2023.findings-emnlp.599
Bibkey:
Cite (ACL):
Cheng-Han Chiang and Hung-yi Lee. 2023. A Closer Look into Using Large Language Models for Automatic Evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
Cite (Informal):
A Closer Look into Using Large Language Models for Automatic Evaluation (Chiang & Lee, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.599.pdf