Xingjian Zhao
2024
L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Chenxin An
|
Shansan Gong
|
Ming Zhong
|
Xingjian Zhao
|
Mukai Li
|
Jun Zhang
|
Lingpeng Kong
|
Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, there has been growing interest in long-context scaling of large language models (LLMs). To facilitate research in this field, we propose L-Eval to institute a more standardized evaluation for Long-Context Language Models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and more than 2,000 human-labeled query-response pairs including diverse task types, domains, and input length (3k~200k tokens). On the other hand, we investigate the effectiveness of evaluation metrics for LCLMs and we show that Length-instruction-enhanced (LIE) evaluation and LLM judges can better correlate with human judgments. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of a more principled evaluation of these models.
Search
Fix data
Co-authors
- Chenxin An 1
- Shansan Gong 1
- Lingpeng Kong 1
- Mukai Li 1
- Xipeng Qiu (邱锡鹏) 1
- show all...
Venues
- acl1