Extracting structured data from invoices

Xavier Holt, Andrew Chisholm


Abstract
Business documents encode a wealth of information in a format tailored to human consumption – i.e. aesthetically disbursed natural language text, graphics and tables. We address the task of extracting key fields (e.g. the amount due on an invoice) from a wide-variety of potentially unseen document formats. In contrast to traditional template driven extraction systems, we introduce a content-driven machine-learning approach which is both robust to noise and generalises to unseen document formats. In a comparison of our approach with alternative invoice extraction systems, we observe an absolute accuracy gain of 20\% across compared fields, and a 25\%–94\% reduction in extraction latency.
Anthology ID:
U18-1006
Volume:
Proceedings of the Australasian Language Technology Association Workshop 2018
Month:
December
Year:
2018
Address:
Dunedin, New Zealand
Editors:
Sunghwan Mac Kim, Xiuzhen (Jenny) Zhang
Venue:
ALTA
SIG:
Publisher:
Note:
Pages:
53–59
Language:
URL:
https://aclanthology.org/U18-1006/
DOI:
Bibkey:
Cite (ACL):
Xavier Holt and Andrew Chisholm. 2018. Extracting structured data from invoices. In Proceedings of the Australasian Language Technology Association Workshop 2018, pages 53–59, Dunedin, New Zealand.
Cite (Informal):
Extracting structured data from invoices (Holt & Chisholm, ALTA 2018)
Copy Citation:
PDF:
https://aclanthology.org/U18-1006.pdf