FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Xin Guo; Haotian Xia; Zhaowei Liu; Hanyang Cao; Zhi Yang; Zhiqiang Liu (刘志强); Sizhe Wang; Jinyi Niu; Chuqi Wang; Yanhui Wang; Xiaolong Liang; Xiaoming Huang; Bing Zhu; Zhongyu Wei; Yun Chen; Weining Shen; Liwen Zhang

doi:10.18653/v1/2025.naacl-long.318

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, Liwen Zhang

Abstract

Large language models have demonstrated outstanding performance in various natural language processing tasks, but their security capabilities in the financial domain have not been explored, and their performance on complex tasks like financial agent remains unknown. This paper presents FinEval, a benchmark designed to evaluate LLMs’ financial domain knowledge and practical abilities. The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent. Financial Academic Knowledge comprises 4,661 multiple-choice questions spanning 34 subjects such as finance and economics. Financial Industry Knowledge contains 1,434 questions covering practical scenarios like investment research. Financial Security Knowledge assesses models through 1,640 questions on topics like application security and cryptography. Financial Agent evaluates tool usage and complex reasoning with 616 questions. FinEval has multiple evaluation settings, including zero-shot, five-shot with chain-of-thought, and assesses model performance using objective and subjective criteria. Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting. Our work provides a comprehensive benchmark closely aligned with Chinese financial domain. The data and the code are available at https://github.com/SUFE-AIFLMLab/FinEval.

Anthology ID:: 2025.naacl-long.318
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6258–6292
Language:
URL:: https://aclanthology.org/2025.naacl-long.318/
DOI:: 10.18653/v1/2025.naacl-long.318
Bibkey:
Cite (ACL):: Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, and Liwen Zhang. 2025. FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6258–6292, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models (Guo et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.318.pdf

PDF Cite Search Fix data