Junfeng Zhan
2024
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
Boxi Cao
|
Mengjie Ren
|
Hongyu Lin
|
Xianpei Han
|
Feng Zhang
|
Junfeng Zhan
|
Le Sun
Findings of the Association for Computational Linguistics: ACL 2024
Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggle to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, this paper proposes a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluations for large language models. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination, and reducing the interference of potential biases, thereby providing a more reliable and consistent conclusion regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.
Search
Co-authors
- Boxi Cao 1
- Mengjie Ren 1
- Hongyu Lin 1
- Xianpei Han 1
- Feng Zhang 1
- show all...
- Le Sun 1