Minsung Jung


2024

pdf bib
Language, OCR, Form Independent (LOFI) pipeline for Industrial Document Information Extraction
Chang Oh Yoon | Wonbeen Lee | Seokhwan Jang | Kyuwon Choi | Minsung Jung | Daewoo Choi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

This paper presents LOFI (Language, OCR, Form Independent), a pipeline for Document Information Extraction (DIE) in Low-Resource Language (LRL) business documents. LOFI pipeline solves language, Optical Character Recognition (OCR), and form dependencies through flexible model architecture, a token-level box split algorithm, and the SPADE decoder. Experiments on Korean and Japanese documents demonstrate high performance in Semantic Entity Recognition (SER) task without additional pre-training. The pipeline’s effectiveness is validated through real-world applications in insurance and tax-free declaration services, advancing DIE capabilities for diverse languages and document types in industrial settings.