Key Information Extraction in Purchase Documents using Deep Learning and Rule-based Corrections

Roberto Arroyo; Javier Yebes; Elena Martínez; Héctor Corrales; Javier Lorenzo

Key Information Extraction in Purchase Documents using Deep Learning and Rule-based Corrections

Roberto Arroyo, Javier Yebes, Elena Martínez, Héctor Corrales, Javier Lorenzo

Abstract

Deep Learning (DL) is dominating the fields of Natural Language Processing (NLP) and Computer Vision (CV) in the recent times. However, DL commonly relies on the availability of large data annotations, so other alternative or complementary pattern-based techniques can help to improve results. In this paper, we build upon Key Information Extraction (KIE) in purchase documents using both DL and rule-based corrections. Our system initially trusts on Optical Character Recognition (OCR) and text understanding based on entity tagging to identify purchase facts of interest (e.g., product codes, descriptions, quantities, or prices). These facts are then linked to a same product group, which is recognized by means of line detection and some grouping heuristics. Once these DL approaches are processed, we contribute several mechanisms consisting of rule-based corrections for improving the baseline DL predictions. We prove the enhancements provided by these rule-based corrections over the baseline DL results in the presented experiments for purchase documents from public and NielsenIQ datasets.

Anthology ID:: 2022.pandl-1.2
Volume:: Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Laura Chiticariu, Yoav Goldberg, Gus Hahn-Powell, Clayton T. Morrison, Aakanksha Naik, Rebecca Sharp, Mihai Surdeanu, Marco Valenzuela-Escárcega, Enrique Noriega-Atala
Venue:: PANDL
SIG:
Publisher:: International Conference on Computational Linguistics
Note:
Pages:: 11–20
Language:
URL:: https://aclanthology.org/2022.pandl-1.2
DOI:
Bibkey:
Cite (ACL):: Roberto Arroyo, Javier Yebes, Elena Martínez, Héctor Corrales, and Javier Lorenzo. 2022. Key Information Extraction in Purchase Documents using Deep Learning and Rule-based Corrections. In Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning, pages 11–20, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):: Key Information Extraction in Purchase Documents using Deep Learning and Rule-based Corrections (Arroyo et al., PANDL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.pandl-1.2.pdf
Data: CORD, FUNSD, SROIE

PDF Cite Search