Seokhwan Jang
2024
Language, OCR, Form Independent (LOFI) pipeline for Industrial Document Information Extraction
Chang Oh Yoon
|
Wonbeen Lee
|
Seokhwan Jang
|
Kyuwon Choi
|
Minsung Jung
|
Daewoo Choi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
This paper presents LOFI (Language, OCR, Form Independent), a pipeline for Document Information Extraction (DIE) in Low-Resource Language (LRL) business documents. LOFI pipeline solves language, Optical Character Recognition (OCR), and form dependencies through flexible model architecture, a token-level box split algorithm, and the SPADE decoder. Experiments on Korean and Japanese documents demonstrate high performance in Semantic Entity Recognition (SER) task without additional pre-training. The pipeline’s effectiveness is validated through real-world applications in insurance and tax-free declaration services, advancing DIE capabilities for diverse languages and document types in industrial settings.