OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Chuang Liu; Linhao Yu; Jiaxuan Li; Renren Jin; Yufei Huang; Ling Shi; Junhui Zhang; Xinmeng Ji; Tingting Cui; Tao Liu; Jinwang Song; Hongying Zan (昝红英); Sun Li; Deyi Xiong

doi:10.18653/v1/2024.acl-demos.19

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Chuang Liu, Linhao Yu, Jiaxuan Li, Renren Jin, Yufei Huang, Ling Shi, Junhui Zhang, Xinmeng Ji, Tingting Cui, Tao Liu, Jinwang Song, Hongying Zan, Sun Li, Deyi Xiong

Abstract

The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning. For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs. To evaluate safety, especially anticipated risks (e.g., power-seeking, self-awareness) of advanced LLMs, we include 6 datasets. In addition to these benchmarks, we have implemented a phased public evaluation and benchmark update strategy to ensure that OpenEval is in line with the development of Chinese LLMs or even able to provide cutting-edge benchmark datasets to guide the development of Chinese LLMs. In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters, including both open-source and proprietary models. Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks, more attention should be directed towards broader aspects such as commonsense reasoning, alignment, and safety.

Anthology ID:: 2024.acl-demos.19
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Yixin Cao, Yang Feng, Deyi Xiong
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 190–210
Language:
URL:: https://aclanthology.org/2024.acl-demos.19/
DOI:: 10.18653/v1/2024.acl-demos.19
Bibkey:
Cite (ACL):: Chuang Liu, Linhao Yu, Jiaxuan Li, Renren Jin, Yufei Huang, Ling Shi, Junhui Zhang, Xinmeng Ji, Tingting Cui, Tao Liu, Jinwang Song, Hongying Zan, Sun Li, and Deyi Xiong. 2024. OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 190–210, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety (Liu et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-demos.19.pdf

PDF Cite Search Fix data