Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

Manav Chaudhary; Harshit Gupta; Savita Bhat; Vasudeva Varma

Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

Manav Chaudhary, Harshit Gupta, Savita Bhat, Vasudeva Varma

Abstract

Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality annotators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.

Anthology ID:: 2024.icon-1.22
Volume:: Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:: December
Year:: 2024
Address:: AU-KBC Research Centre, Chennai, India
Editors:: Sobha Lalitha Devi, Karunesh Arora
Venue:: ICON
SIG:
Publisher:: NLP Association of India (NLPAI)
Note:
Pages:: 197–205
Language:
URL:: https://aclanthology.org/2024.icon-1.22/
DOI:
Bibkey:
Cite (ACL):: Manav Chaudhary, Harshit Gupta, Savita Bhat, and Vasudeva Varma. 2024. Towards Understanding the Robustness of LLM-based Evaluations under Perturbations. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 197–205, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):: Towards Understanding the Robustness of LLM-based Evaluations under Perturbations (Chaudhary et al., ICON 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.icon-1.22.pdf

PDF Cite Search Fix data