A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

Han Yuxuan; Yuanxing Zhang; Yushuo Wang; Yichao Jin

A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

Han Yuxuan, Yuanxing Zhang, Yushuo Wang, Yichao Jin

Abstract

Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non-machine-readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task-relevant information. Although recent vision–language models (VLMs) achieve strong benchmark performance, directly applying them end-to-end to full financial reports often leads to unreliable extraction under real-world conditions.We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multi-page documents.We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR–VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM-o-2.6, achieves 87.27% accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.

Anthology ID:: 2026.acl-industry.99
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Yunyao Li, Georg Rehm, Mei Tu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1419–1433
Language:
URL:: https://aclanthology.org/2026.acl-industry.99/
DOI:
Bibkey:
Cite (ACL):: Han Yuxuan, Yuanxing Zhang, Yushuo Wang, and Yichao Jin. 2026. A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1419–1433, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows (Yuxuan et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-industry.99.pdf

PDF Cite Search Fix data