Ketan Todi

2023

pdf bib abs
Understanding Large Language Model Based Metrics for Text Summarization
Abhishek Pradhan | Ketan Todi
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

This paper compares the two most widely used techniques for evaluating generative tasks with large language models (LLMs): prompt-based evaluation and log-likelihood evaluation as part of the Eval4NLP shared task. We focus on the summarization task and evaluate both small and large LLM models. We also study the impact of LLAMA and LLAMA 2 on summarization, using the same set of prompts and techniques. We used the Eval4NLP dataset for our comparison. This study provides evidence of the advantages of prompt-based evaluation techniques over log-likelihood based techniques, especially for large models and models with better reasoning power.

Co-authors

Abhishek Pradhan 1

Venues

eval4nlp1
ws1