Junfeng Zhan


2024

pdf bib
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
Boxi Cao | Mengjie Ren | Hongyu Lin | Xianpei Han | Feng Zhang | Junfeng Zhan | Le Sun
Findings of the Association for Computational Linguistics ACL 2024

Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggle to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, this paper proposes a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluations for large language models. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination, and reducing the interference of potential biases, thereby providing a more reliable and consistent conclusion regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.