Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Justin Zhao; Flor Miriam Plaza-del-Arco; Benjamin Genchel; Amanda Cercas Curry

doi:10.18653/v1/2025.naacl-long.617

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Justin Zhao, Flor Miriam Plaza-del-Arco, Benjamin Genchel, Amanda Cercas Curry

Abstract

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks – such as those related to emotional intelligence, creative writing, and persuasiveness – may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other’s responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.

Anthology ID:: 2025.naacl-long.617
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12395–12450
Language:
URL:: https://aclanthology.org/2025.naacl-long.617/
DOI:: 10.18653/v1/2025.naacl-long.617
Bibkey:
Cite (ACL):: Justin Zhao, Flor Miriam Plaza-del-Arco, Benjamin Genchel, and Amanda Cercas Curry. 2025. Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12395–12450, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks (Zhao et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.617.pdf

PDF Cite Search Fix data