Weak Supervision using Linguistic Knowledge for Information Extraction

Sachin Pawar, Girish Palshikar, Ankita Jain, Jyoti Bhat, Simi Johnson


Abstract
In this paper, we propose to use linguistic knowledge to automatically augment a small manually annotated corpus to obtain a large annotated corpus for training Information Extraction models. We propose a powerful patterns specification language for specifying linguistic rules for entity extraction. We define an Enriched Text Format (ETF) to represent rich linguistic information about a text in the form of XML-like tags. The patterns in our patterns specification language are then matched on the ETF text rather than raw text to extract various entity mentions. We demonstrate how an entity extraction system can be quickly built for a domain-specific entity type for which there are no readily available annotated datasets.
Anthology ID:
2020.icon-main.50
Volume:
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2020
Address:
Indian Institute of Technology Patna, Patna, India
Editors:
Pushpak Bhattacharyya, Dipti Misra Sharma, Rajeev Sangal
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
368–372
Language:
URL:
https://aclanthology.org/2020.icon-main.50
DOI:
Bibkey:
Cite (ACL):
Sachin Pawar, Girish Palshikar, Ankita Jain, Jyoti Bhat, and Simi Johnson. 2020. Weak Supervision using Linguistic Knowledge for Information Extraction. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 368–372, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
Weak Supervision using Linguistic Knowledge for Information Extraction (Pawar et al., ICON 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.icon-main.50.pdf