Yijiang River Dong
2024
Can LLM be a Personalized Judge?
Yijiang River Dong
|
Tiancheng Hu
|
Nigel Collier
Findings of the Association for Computational Linguistics: EMNLP 2024
As large language models (LLMs) gain widespread adoption, ensuring they cater to diverse user needs has become increasingly important. While many researchers have studied LLM personalization and role-playing, they primarily use LLM-as-a-Judge for evaluation without thoroughly examining its validity. This paper investigates the reliability of LLM-as-a-Personalized-Judge—asking LLMs to judge user preferences based on persona. Our results suggest that LLM-as-a-Personalized-Judge is less reliable for personalization than previously believed, showing low agreement with human ground truth. We observed that the personas provided to the LLM often have limited predictive power for the tasks, leading us to introduce verbal uncertainty estimation. We find that powerful LLMs are aware of the certainty of their prediction and can achieve high agreement with ground truth on high-certainty samples, indicating a promising approach for building reliable and scalable proxies for evaluating LLM personalization. Our human annotation reveals that third-person crowd worker evaluations of personalized preferences are even worse than LLM predictions, highlighting the challenges of evaluating LLM personalization.
2023
CoRRPUS: Code-based Structured Prompting for Neurosymbolic Story Understanding
Yijiang River Dong
|
Lara J. Martin
|
Chris Callison-Burch
Findings of the Association for Computational Linguistics: ACL 2023
Story generation and understanding—as with all NLG/NLU tasks—has seen a surge in neurosymbolic work. Researchers have recognized that, while large language models (LLMs) have tremendous utility, they can be augmented with symbolic means to be even better and to make up for many flaws that neural networks have. However, symbolic methods are extremely costly in terms of the amount of time and expertise needed to create them. In this work, we capitalize on state-of-the-art Code-LLMs, such as Codex, to bootstrap the use of symbolic methods for tracking the state of stories and aiding in story understanding. We show that our CoRRPUS system and abstracted prompting procedures can beat current state-of-the-art structured LLM techniques on pre-existing story understanding tasks (bAbI Task 2 and Re³) with minimal hand engineering. This work highlights the usefulness of code-based symbolic representations for enabling LLMs to better perform story reasoning tasks.
Search