Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs

Chuang Liu, Renren Jin, Zheng Yao, Tianyi Li, Liang Cheng, Mark Steedman, Deyi Xiong


Abstract
Previous benchmarks for evaluating large language models (LLMs) have primarily emphasized quantitative metrics, such as data volume. However, this focus may neglect key qualitative data attributes that can significantly impact the final rankings of LLMs, resulting in unreliable leaderboards. In this paper, we investigate whether current LLM benchmarks adequately consider these data attributes. We specifically examine three attributes: diversity, redundancy, and difficulty. To explore these attributes, we propose a framework with three separate modules, each designed to assess one of the attributes. Using a method that progressively incorporates these attributes, we analyze their influence on the benchmark. Our experimental results reveal a meaningful correlation between LLM rankings on the revised benchmark and the original benchmark when these attributes are accounted for. These findings indicate that existing benchmarks often fail to meet all three criteria, highlighting a lack of consideration for multifaceted data attributes in current evaluation datasets.
Anthology ID:
2025.coling-main.403
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6024–6038
Language:
URL:
https://aclanthology.org/2025.coling-main.403/
DOI:
Bibkey:
Cite (ACL):
Chuang Liu, Renren Jin, Zheng Yao, Tianyi Li, Liang Cheng, Mark Steedman, and Deyi Xiong. 2025. Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6024–6038, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs (Liu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.403.pdf