Human-Centered Design Recommendations for LLM-as-a-judge

Qian Pan, Zahra Ashktorab, Michael Desmond, Martín Santillán Cooper, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer


Abstract
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
Anthology ID:
2024.hucllm-1.2
Volume:
Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Month:
August
Year:
2024
Address:
TBD
Editors:
Nikita Soni, Lucie Flek, Ashish Sharma, Diyi Yang, Sara Hooker, H. Andrew Schwartz
Venues:
HuCLLM | WS
SIG:
Publisher:
ACL
Note:
Pages:
16–29
Language:
URL:
https://aclanthology.org/2024.hucllm-1.2
DOI:
Bibkey:
Cite (ACL):
Qian Pan, Zahra Ashktorab, Michael Desmond, Martín Santillán Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-judge. In Proceedings of the 1st Human-Centered Large Language Modeling Workshop, pages 16–29, TBD. ACL.
Cite (Informal):
Human-Centered Design Recommendations for LLM-as-a-judge (Pan et al., HuCLLM-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.hucllm-1.2.pdf