Can LLM be a Personalized Judge?

Yijiang River Dong; Tiancheng Hu; Nigel Collier

doi:10.18653/v1/2024.findings-emnlp.592

Can LLM be a Personalized Judge?

Yijiang River Dong, Tiancheng Hu, Nigel Collier

Abstract

As large language models (LLMs) gain widespread adoption, ensuring they cater to diverse user needs has become increasingly important. While many researchers have studied LLM personalization and role-playing, they primarily use LLM-as-a-Judge for evaluation without thoroughly examining its validity. This paper investigates the reliability of LLM-as-a-Personalized-Judge—asking LLMs to judge user preferences based on persona. Our results suggest that LLM-as-a-Personalized-Judge is less reliable for personalization than previously believed, showing low agreement with human ground truth. We observed that the personas provided to the LLM often have limited predictive power for the tasks, leading us to introduce verbal uncertainty estimation. We find that powerful LLMs are aware of the certainty of their prediction and can achieve high agreement with ground truth on high-certainty samples, indicating a promising approach for building reliable and scalable proxies for evaluating LLM personalization. Our human annotation reveals that third-person crowd worker evaluations of personalized preferences are even worse than LLM predictions, highlighting the challenges of evaluating LLM personalization.

Anthology ID:: 2024.findings-emnlp.592
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10126–10141
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.592/
DOI:: 10.18653/v1/2024.findings-emnlp.592
Bibkey:
Cite (ACL):: Yijiang River Dong, Tiancheng Hu, and Nigel Collier. 2024. Can LLM be a Personalized Judge?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10126–10141, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Can LLM be a Personalized Judge? (Dong et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.592.pdf
Software:: 2024.findings-emnlp.592.software.zip

PDF Cite Search Software Fix data