Wonhee Lee


2022

pdf bib
ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents
Brian Chivers | Mason P. Jiang | Wonhee Lee | Amy Ng | Natalya I. Rapstine | Alex Storer
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Text segmentation and extraction from unstructured documents can provide business researchers with a wealth of new information on firms and their behaviors. However, the most valuable text is often difficult to extract consistently due to substantial variations in how content can appear from document to document. Thus, the most successful way to extract this content has been through costly crowdsourcing and training of manual workers. We propose the Assisted Neural Text Segmentation (ANTS) framework to identify pertinent text in unstructured documents from a small set of labeled examples. ANTS leverages deep learning and transfer learning architectures to empower researchers to identify relevant text with minimal manual coding. Using a real world sample of accounting documents, we identify targeted sections 96% of the time using only 5 training examples.