Megha Mishra
2021
TopGuNN: Fast NLP Training Data Augmentation using Large Corpora
Rebecca Iglesias-Flores
|
Megha Mishra
|
Ajay Patel
|
Akanksha Malhotra
|
Reno Kriz
|
Martha Palmer
|
Chris Callison-Burch
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.
Search
Fix data
Co-authors
- Chris Callison-Burch 1
- Rebecca Iglesias-Flores 1
- Reno Kriz 1
- Akanksha Malhotra 1
- Martha Palmer 1
- show all...
Venues
- dash1