Hao Shen
2026
Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents
Ying He | Zhouhong Gu | Zhecheng Hu | Yubo Zhou | Hao Shen | Jiaqing Liang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao | Zhixu Li
Findings of the Association for Computational Linguistics: ACL 2026
Ying He | Zhouhong Gu | Zhecheng Hu | Yubo Zhou | Hao Shen | Jiaqing Liang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao | Zhixu Li
Findings of the Association for Computational Linguistics: ACL 2026
Ensuring the accuracy of financial documents is critical for economic analysis, regulatory compliance, and corporate decision-making. Several studies have shown that Large Language Models (LLMs) perform well in many financial tasks, such as stock price movements and financial analytics. However, a critical task remains unexplored: the ability of LLMs to identify errors in financial documents. In this paper, we introduce **FinED-Bench**, the first publicly Benchmark for Financial Error Detection across three levels of cognitive complexity. FinED-Bench covers nine real-world financial scenarios, and includes over 900 documents reported in 2025 that are unseen by existing language models. We detail the benchmark construction process and evaluate several advanced LLMs (e.g., GPT-4o, Qwen3-14B) on this tasks, which requires both financial domain knowledge and reasoning capabilities. Experimental results show that current LLMs still struggle with this task, especially in high-complexity cases. Besides, supervised fine-tuning can significantly improve the performance of weaker LLMs on this task. Our data and code are available at https://anonymous.4open.science/r/FinED-Bench-406F.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen | Zhouhong Gu | Haokai Hong | Weili Han | Hongfeng Chai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hao Shen | Zhouhong Gu | Haokai Hong | Weili Han | Hongfeng Chai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 7 PII types with 55 fine-grained subcategories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even advanced LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.