Wenhong Zhu


2024

pdf bib
CLEANEVAL: Clean Evaluation on Contaminated Large Language Models
Wenhong Zhu | Hongkun Hao | Zhiwei He | Yun-Ze Song | Jiao Yueyang | Yumeng Zhang | Hanxu Hu | Yiran Wei | Rui Wang | Hongyuan Lu
Findings of the Association for Computational Linguistics: NAACL 2024

We are currently in an era of fierce competition among various large language models (LLMs), continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination. In this paper, we propose a novel and valuable method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs more cleanly. Clean-Eval employs a neural-based model to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter those generated low-quality samples to narrow down this candidate set. Candidates with moderate BLEURT scores against the original samples are selected as the final evaluation set. According to human assessment, this set is almost semantically equivalent to the original contamination set but expressed differently. We conduct experiments on 20 existing benchmarks across diverse tasks, and results demonstrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios.

2023

pdf bib
Penalty Decoding: Well Suppress the Self-Reinforcement Effect in Open-Ended Text Generation
Wenhong Zhu | Hongkun Hao | Rui Wang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The decoding algorithm is critical for open-ended text generation, transforming latent representations into coherent and meaningful outputs. This paper investigates the self-reinforcement effect in text generation and the effectiveness of a repetition penalty to mitigate it. However, determining the optimal repetition penalty value is challenging. To tackle this, we propose a forgetting mechanism that disregards distant tokens, reducing the burden of penalty selection. In addition, we introduce a length penalty to address overly short sentences caused by excessive penalties. Our penalty decoding approach incorporating three strategies helps resolve issues with sampling methods deviating from factual information. Experimental results demonstrate the efficacy of our approach in generating high-quality sentences resembling human output.