Ziyu Zhuang
2023
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models
Ziyu Zhuang | Qiguang Chen | Longxuan Ma | Mingda Li | Yi Han | Yushan Qian | Haopeng Bai | Weinan Zhang | Ting Liu
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)
Ziyu Zhuang | Qiguang Chen | Longxuan Ma | Mingda Li | Yi Han | Yushan Qian | Haopeng Bai | Weinan Zhang | Ting Liu
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)
“From pre-trained language model (PLM) to large language model (LLM), the field of naturallanguage processing (NLP) has witnessed steep performance gains and wide practical uses. Theevaluation of a research field guides its direction of improvement. However, LLMs are extremelyhard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inade-quate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficultto keep up with the wide range of applications in real-world scenarios. To tackle these problems,existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerousevaluation tasks in both academia and industry, we investigate multiple papers concerning LLMevaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, relia-bility, and safety. For every competency, we introduce its definition, corresponding benchmarks,and metrics. Under this competency architecture, similar tasks are combined to reflect corre-sponding ability, while new tasks can also be easily added into the system. Finally, we give oursuggestions on the future direction of LLM’s evaluation.”
2022
SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation
Longxuan Ma | Ziyu Zhuang | Weinan Zhang | Mingda Li | Ting Liu
Proceedings of the 29th International Conference on Computational Linguistics
Longxuan Ma | Ziyu Zhuang | Weinan Zhang | Mingda Li | Ting Liu
Proceedings of the 29th International Conference on Computational Linguistics
This paper introduces a novel Self-supervised Fine-grained Dialogue Evaluation framework (SelF-Eval). The core idea is to model the correlation between turn quality and the entire dialogue quality. We first propose a novel automatic data construction method that can automatically assign fine-grained scores for arbitrarily dialogue data. Then we train SelF-Eval with a multi-level contrastive learning schema which helps to distinguish different score levels. Experimental results on multiple benchmarks show that SelF-Eval is highly consistent with human evaluations and better than the state-of-the-art models. We give a detailed analysis of the experiments in this paper. Our code is available on GitHub.