Zheng Yao
2025
Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs
Chuang Liu
|
Renren Jin
|
Zheng Yao
|
Tianyi Li
|
Liang Cheng
|
Mark Steedman
|
Deyi Xiong
Proceedings of the 31st International Conference on Computational Linguistics
Previous benchmarks for evaluating large language models (LLMs) have primarily emphasized quantitative metrics, such as data volume. However, this focus may neglect key qualitative data attributes that can significantly impact the final rankings of LLMs, resulting in unreliable leaderboards. In this paper, we investigate whether current LLM benchmarks adequately consider these data attributes. We specifically examine three attributes: diversity, redundancy, and difficulty. To explore these attributes, we propose a framework with three separate modules, each designed to assess one of the attributes. Using a method that progressively incorporates these attributes, we analyze their influence on the benchmark. Our experimental results reveal a meaningful correlation between LLM rankings on the revised benchmark and the original benchmark when these attributes are accounted for. These findings indicate that existing benchmarks often fail to meet all three criteria, highlighting a lack of consideration for multifaceted data attributes in current evaluation datasets.
Search
Fix data
Co-authors
- Liang Cheng 1
- Renren Jin 1
- Tianyi Li 1
- Chuang Liu 1
- Mark Steedman 1
- show all...