Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang; Nathan Huang; Justin Tang; Wenqian Chen; Elsa Fan

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan

Abstract

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.

Anthology ID:: 2026.gem-main.58
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 595–603
Language:
URL:: https://aclanthology.org/2026.gem-main.58/
DOI:
Bibkey:
Cite (ACL):: Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, and Elsa Fan. 2026. Permutation-Consensus Listwise Judging for Robust Factuality Evaluation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 595–603, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Permutation-Consensus Listwise Judging for Robust Factuality Evaluation (Huang et al., GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.58.pdf

PDF Cite Search Fix data