Cartography Active Learning

Mike Zhang, Barbara Plank


Abstract
We propose Cartography Active Learning (CAL), a novel Active Learning (AL) algorithm that exploits the behavior of the model on individual instances during training as a proxy to find the most informative instances for labeling. CAL is inspired by data maps, which were recently proposed to derive insights into dataset quality (Swayamdipta et al., 2020). We compare our method on popular text classification tasks to commonly used AL strategies, which instead rely on post-training behavior. We demonstrate that CAL is competitive to other common AL methods, showing that training dynamics derived from small seed data can be successfully used for AL. We provide insights into our new AL method by analyzing batch-level statistics utilizing the data maps. Our results further show that CAL results in a more data-efficient learning strategy, achieving comparable or better results with considerably less training data.
Anthology ID:
2021.findings-emnlp.36
Original:
2021.findings-emnlp.36v1
Version 2:
2021.findings-emnlp.36v2
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
395–406
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.36
DOI:
10.18653/v1/2021.findings-emnlp.36
Bibkey:
Cite (ACL):
Mike Zhang and Barbara Plank. 2021. Cartography Active Learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 395–406, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Cartography Active Learning (Zhang & Plank, Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.36.pdf
Software:
 2021.findings-emnlp.36.Software.zip
Video:
 https://aclanthology.org/2021.findings-emnlp.36.mp4
Code
 Kaleidophon/deep-significance +  additional community code
Data
AG News