Style Over Substance: Evaluation Biases for Large Language Models

Minghao Wu, Alham Fikri Aji


Abstract
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human or LLM judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed, machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced evaluations, indicating the need for further investigation.
Anthology ID:
2025.coling-main.21
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
297–312
Language:
URL:
https://aclanthology.org/2025.coling-main.21/
DOI:
Bibkey:
Cite (ACL):
Minghao Wu and Alham Fikri Aji. 2025. Style Over Substance: Evaluation Biases for Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 297–312, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Style Over Substance: Evaluation Biases for Large Language Models (Wu & Aji, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.21.pdf