Prototype-Representations for Training Data Filtering in Weakly-Supervised Information Extraction

Nasser Zalmout, Xian Li


Abstract
The availability of high quality training data is still a bottleneck for the practical utilization of information extraction models, despite the breakthroughs in zero and few-shot learning techniques. This is further exacerbated for industry applications, where new tasks, domains, and specific use cases keep arising, which makes it impractical to depend on manually annotated data. Therefore, weak and distant supervision emerged as popular approaches to bootstrap training, utilizing labeling functions to guide the annotation process. Weakly-supervised annotation of training data is fast and efficient, however, it results in many irrelevant and out-of-context matches. This is a challenging problem that can degrade the performance in downstream models, or require a manual data cleaning step that can incur significant overhead. In this paper we present a prototype-based filtering approach, that can be utilized to denoise weakly supervised training data. The system is very simple, unsupervised, scalable, and requires little manual intervention, yet results in significant precision gains. We apply the technique in the task of attribute value extraction in e-commerce websites, and achieve up to 9% gain in precision for the downstream models, with a minimal drop in recall.
Anthology ID:
2022.emnlp-industry.47
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2022
Address:
Abu Dhabi, UAE
Editors:
Yunyao Li, Angeliki Lazaridou
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
467–474
Language:
URL:
https://aclanthology.org/2022.emnlp-industry.47
DOI:
10.18653/v1/2022.emnlp-industry.47
Bibkey:
Cite (ACL):
Nasser Zalmout and Xian Li. 2022. Prototype-Representations for Training Data Filtering in Weakly-Supervised Information Extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 467–474, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Prototype-Representations for Training Data Filtering in Weakly-Supervised Information Extraction (Zalmout & Li, EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-industry.47.pdf