Sirui Liang


2024

pdf bib
ZhuJiu-Knowledge: A Fairer Platform for Evaluating Multiple Knowledge Types in Large Language Models
Pengfan Du | Sirui Liang | Baoli Zhang | Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

The swift advancement in large language models (LLMs) has heightened the importance of model evaluations. LLMs have acquired a substantial amount of knowledge, and evaluating the knowledge of these LLMs is crucial. To address this, we introduce the ZhuJiu-Knowledge benchmark which carefully considers the following factors: (1) For knowledge scope, we concentrate on three domains: commonsense knowledge, world knowledge, language knowledge, which comes from ATOMIC, Conceptnet, Wikidata, and Wordnet. (2) For data construction, to prevent data contamination, we utilize knowledge derived from corpora and knowledge graphs to formulate novel questions which are ensured not to appear in the training corpus. A multitude of prompts is purposefully devised to mitigate the impact of prompt design on evaluation and to further analyze the LLMs’ sensitivity to various prompts. (3) For evaluation criteria, we propose a novel voting methodology for assessing generative text, aligning the model’s evaluation with human preferences to reduce biases inherent in individual model assessments. We evaluate 14 current mainstream LLMs and conduct a comprehensive discussion and analysis of their results. The ZhuJiu-Knowledge benchmark and open-participation leaderboard are publicly released at http://zhujiu-knowledge.top and we also provide a demo video at https://youtu.be/QJp4qlEHVH8.