CLEANEVAL: Clean Evaluation on Contaminated Large Language Models

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yun-Ze Song, Jiao Yueyang, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, Hongyuan Lu


Abstract
We are currently in an era of fierce competition among various large language models (LLMs), continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination. In this paper, we propose a novel and valuable method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs more cleanly. Clean-Eval employs a neural-based model to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter those generated low-quality samples to narrow down this candidate set. Candidates with moderate BLEURT scores against the original samples are selected as the final evaluation set. According to human assessment, this set is almost semantically equivalent to the original contamination set but expressed differently. We conduct experiments on 20 existing benchmarks across diverse tasks, and results demonstrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios.
Anthology ID:
2024.findings-naacl.53
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
835–847
Language:
URL:
https://aclanthology.org/2024.findings-naacl.53
DOI:
10.18653/v1/2024.findings-naacl.53
Bibkey:
Cite (ACL):
Wenhong Zhu, Hongkun Hao, Zhiwei He, Yun-Ze Song, Jiao Yueyang, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, and Hongyuan Lu. 2024. CLEAN–EVAL: Clean Evaluation on Contaminated Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 835–847, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
CLEAN–EVAL: Clean Evaluation on Contaminated Large Language Models (Zhu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.53.pdf