ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents

Brian Chivers; Mason P. Jiang; Wonhee Lee; Amy Ng; Natalya I. Rapstine; Alex Storer

doi:10.18653/v1/2022.deeplo-1.5

ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents

Brian Chivers, Mason P. Jiang, Wonhee Lee, Amy Ng, Natalya I. Rapstine, Alex Storer

Abstract

Text segmentation and extraction from unstructured documents can provide business researchers with a wealth of new information on firms and their behaviors. However, the most valuable text is often difficult to extract consistently due to substantial variations in how content can appear from document to document. Thus, the most successful way to extract this content has been through costly crowdsourcing and training of manual workers. We propose the Assisted Neural Text Segmentation (ANTS) framework to identify pertinent text in unstructured documents from a small set of labeled examples. ANTS leverages deep learning and transfer learning architectures to empower researchers to identify relevant text with minimal manual coding. Using a real world sample of accounting documents, we identify targeted sections 96% of the time using only 5 training examples.

Anthology ID:: 2022.deeplo-1.5
Volume:: Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Month:: July
Year:: 2022
Address:: Hybrid
Editors:: Colin Cherry, Angela Fan, George Foster, Gholamreza (Reza) Haffari, Shahram Khadivi, Nanyun (Violet) Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
Venue:: DeepLo
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38–47
Language:
URL:: https://aclanthology.org/2022.deeplo-1.5/
DOI:: 10.18653/v1/2022.deeplo-1.5
Bibkey:
Cite (ACL):: Brian Chivers, Mason P. Jiang, Wonhee Lee, Amy Ng, Natalya I. Rapstine, and Alex Storer. 2022. ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 38–47, Hybrid. Association for Computational Linguistics.
Cite (Informal):: ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents (Chivers et al., DeepLo 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.deeplo-1.5.pdf
Video:: https://aclanthology.org/2022.deeplo-1.5.mp4

PDF Cite Search Video Fix data