Understanding Large Language Model Based Metrics for Text Summarization

Abhishek Pradhan, Ketan Todi


Abstract
This paper compares the two most widely used techniques for evaluating generative tasks with large language models (LLMs): prompt-based evaluation and log-likelihood evaluation as part of the Eval4NLP shared task. We focus on the summarization task and evaluate both small and large LLM models. We also study the impact of LLAMA and LLAMA 2 on summarization, using the same set of prompts and techniques. We used the Eval4NLP dataset for our comparison. This study provides evidence of the advantages of prompt-based evaluation techniques over log-likelihood based techniques, especially for large models and models with better reasoning power.
Anthology ID:
2023.eval4nlp-1.12
Volume:
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Month:
November
Year:
2023
Address:
Bali, Indonesia
Editors:
Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao, Christoph Leiter, Juri Opitz, Andreas Rücklé
Venues:
Eval4NLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
149–155
Language:
URL:
https://aclanthology.org/2023.eval4nlp-1.12
DOI:
10.18653/v1/2023.eval4nlp-1.12
Bibkey:
Cite (ACL):
Abhishek Pradhan and Ketan Todi. 2023. Understanding Large Language Model Based Metrics for Text Summarization. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, pages 149–155, Bali, Indonesia. Association for Computational Linguistics.
Cite (Informal):
Understanding Large Language Model Based Metrics for Text Summarization (Pradhan & Todi, Eval4NLP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eval4nlp-1.12.pdf