Field Extraction from Forms with Unlabeled Data

Mingfei Gao, Zeyuan Chen, Nikhil Naik, Kazuma Hashimoto, Caiming Xiong, Ran Xu


Abstract
We propose a novel framework to conduct field extraction from forms with unlabeled data. To bootstrap the training process, we develop a rule-based method for mining noisy pseudo-labels from unlabeled forms. Using the supervisory signal from the pseudo-labels, we extract a discriminative token representation from a transformer-based model by modeling the interaction between text in the form. To prevent the model from overfitting to label noise, we introduce a refinement module based on a progressive pseudo-label ensemble. Experimental results demonstrate the effectiveness of our framework.
Anthology ID:
2022.spanlp-1.4
Volume:
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge
Month:
May
Year:
2022
Address:
Dublin, Ireland and Online
Editors:
Rajarshi Das, Patrick Lewis, Sewon Min, June Thai, Manzil Zaheer
Venue:
SpaNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30–40
Language:
URL:
https://aclanthology.org/2022.spanlp-1.4
DOI:
10.18653/v1/2022.spanlp-1.4
Bibkey:
Cite (ACL):
Mingfei Gao, Zeyuan Chen, Nikhil Naik, Kazuma Hashimoto, Caiming Xiong, and Ran Xu. 2022. Field Extraction from Forms with Unlabeled Data. In Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge, pages 30–40, Dublin, Ireland and Online. Association for Computational Linguistics.
Cite (Informal):
Field Extraction from Forms with Unlabeled Data (Gao et al., SpaNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.spanlp-1.4.pdf
Video:
 https://aclanthology.org/2022.spanlp-1.4.mp4
Code
 salesforce/inv-cdip +  additional community code