Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents

Ying He; Zhouhong Gu; Zhecheng Hu; Yubo Zhou; Hao Shen; Jiaqing Liang; Zhaoqian Dai; Ma Shuguang; Fei Yu; Yanghua Xiao; Zhixu Li

Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents

Ying He, Zhouhong Gu, Zhecheng Hu, Yubo Zhou, Hao Shen, Jiaqing Liang, Zhaoqian Dai, Ma Shuguang, Fei Yu, Yanghua Xiao, Zhixu Li

Abstract

Ensuring the accuracy of financial documents is critical for economic analysis, regulatory compliance, and corporate decision-making. Several studies have shown that Large Language Models (LLMs) perform well in many financial tasks, such as stock price movements and financial analytics. However, a critical task remains unexplored: the ability of LLMs to identify errors in financial documents. In this paper, we introduce **FinED-Bench**, the first publicly Benchmark for Financial Error Detection across three levels of cognitive complexity. FinED-Bench covers nine real-world financial scenarios, and includes over 900 documents reported in 2025 that are unseen by existing language models. We detail the benchmark construction process and evaluate several advanced LLMs (e.g., GPT-4o, Qwen3-14B) on this tasks, which requires both financial domain knowledge and reasoning capabilities. Experimental results show that current LLMs still struggle with this task, especially in high-complexity cases. Besides, supervised fine-tuning can significantly improve the performance of weaker LLMs on this task. Our data and code are available at https://anonymous.4open.science/r/FinED-Bench-406F.

Anthology ID:: 2026.findings-acl.1481
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29625–29643
Language:
URL:: https://aclanthology.org/2026.findings-acl.1481/
DOI:
Bibkey:
Cite (ACL):: Ying He, Zhouhong Gu, Zhecheng Hu, Yubo Zhou, Hao Shen, Jiaqing Liang, Zhaoqian Dai, Ma Shuguang, Fei Yu, Yanghua Xiao, and Zhixu Li. 2026. Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29625–29643, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents (He et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1481.pdf
Checklist:: 2026.findings-acl.1481.checklist.pdf

PDF Cite Search Checklist Fix data