2024
pdf
bib
abs
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
Zhuohao Yu
|
Chang Gao
|
Wenjin Yao
|
Yidong Wang
|
Zhengran Zeng
|
Wei Ye
|
Jindong Wang
|
Yue Zhang
|
Shikun Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The rapid growth of evaluation methodologies and datasets for large language models (LLMs) has created a pressing need for their unified integration. Meanwhile, concerns about data contamination and bias compromise the trustworthiness of evaluation findings, while the efficiency of evaluation processes remains a bottleneck due to the significant computational costs associated with LLM inference.In response to these challenges, we introduce FreeEval, a modular framework not only for conducting trustworthy and efficient automatic evaluations of LLMs but also serving as a platform to develop and validate new evaluation methodologies. FreeEval addresses key challenges through: (1) unified abstractions that simplify the integration of diverse evaluation methods, including dynamic evaluations requiring complex LLM interactions; (2) built-in meta-evaluation techniques such as data contamination detection and human evaluation to enhance result fairness; (3) a high-performance infrastructure with distributed computation and caching strategies for efficient large-scale evaluations; and (4) an interactive Visualizer for result analysis and interpretation to support innovation of evaluation techniques. We open-source all our code at https://github.com/WisdomShell/FreeEval and our demostration video, live demo, installation guides are available at: https://freeeval.zhuohao.me/.
pdf
bib
abs
PURE: Aligning LLM via Pluggable Query Reformulation for Enhanced Helpfulness
Wenjin Yao
|
Yidong Wang
|
Zhuohao Yu
|
Rui Xie
|
Shikun Zhang
|
Wei Ye
Findings of the Association for Computational Linguistics: EMNLP 2024
Aligning large language models (LLMs) with human values and preferences is a significant challenge. Training-based methods, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), require substantial resources and are impractical for API-based LLMs. Post-processing methods decouple alignment from training but may incur high multiple-time inference costs or rely on less knowledgeable lightweight models for response refinement. In this paper, we propose a new LLM alignment paradigm from the perspective of pre-processing. By reformulating risky queries into highly relevant yet harmless ones before feeding them into LLMs, our method eliminates the high costs of training base LLMs, efficiently applies to both open-source and proprietary LLMs, and achieves a promising balance of harmlessness and helpfulness. For example, with Vicuna-7B as the LLM to align, it enhances helpfulness by 28.52% over DPO while maintaining comparable harmlessness levels. When applied to Gemini-1.5-pro, it increased harmlessness and helpfulness by 7.04% and 29.37%, respectively.
pdf
bib
abs
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Zhuohao Yu
|
Chang Gao
|
Wenjin Yao
|
Yidong Wang
|
Wei Ye
|
Jindong Wang
|
Xing Xie
|
Yue Zhang
|
Shikun Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered “interactor” role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model’s response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval’s effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models’ real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.