The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger


Abstract
Generative large language models (LLMs) have seen many breakthroughs over the last year. With an increasing number of parameters and pre-training data, they have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Strategies employed in this context differ in the choice of input prompts, the selection of samples for demonstration, and the methodology used to construct scores grading the generations. Approaches often differ in the input prompts, the samples that are selected for demonstration and the construction process of scores from the output. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore such approaches for machine translation evaluation and summarization eval- uation. Specifically, we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We test the approaches of the participants on a new reference-free test-set spanning 3 language pairs for machine transla- tion as well as a summarization dataset. Further, we present an overview of the approaches taken by the participants, present their results on the test set and analyze paths for future work. Fi- nally, as a separate track, we perform a human evaluation of the plausibility of explanations given by the LLMs and its effect on model performance. We make parts of our code and datasets available.
Anthology ID:
2023.eval4nlp-1.10
Volume:
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Month:
November
Year:
2023
Address:
Bali, Indonesia
Editors:
Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao, Christoph Leiter, Juri Opitz, Andreas Rücklé
Venues:
Eval4NLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
117–138
Language:
URL:
https://aclanthology.org/2023.eval4nlp-1.10
DOI:
10.18653/v1/2023.eval4nlp-1.10
Bibkey:
Cite (ACL):
Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, and Steffen Eger. 2023. The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, pages 117–138, Bali, Indonesia. Association for Computational Linguistics.
Cite (Informal):
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics (Leiter et al., Eval4NLP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eval4nlp-1.10.pdf