Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Yichao Zhou, James Bradley Wendt, Navneet Potti, Jing Xie, Sandeep Tata


Abstract
Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.
Anthology ID:
2023.emnlp-main.233
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3847–3860
Language:
URL:
https://aclanthology.org/2023.emnlp-main.233
DOI:
10.18653/v1/2023.emnlp-main.233
Bibkey:
Cite (ACL):
Yichao Zhou, James Bradley Wendt, Navneet Potti, Jing Xie, and Sandeep Tata. 2023. Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3847–3860, Singapore. Association for Computational Linguistics.
Cite (Informal):
Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models (Zhou et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.233.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.233.mp4