AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets

Pietro Lesci, Andreas Vlachos


Abstract
Active learning for imbalanced classification tasks is challenging as the minority classes naturally occur rarely. Gathering a large pool of unlabelled data is thus essential to capture minority instances. Standard pool-based active learning is computationally expensive on large pools and often reaches low accuracy by overfitting the initial decision boundary, thus failing to explore the input space and find minority instances. To address these issues we propose AnchorAL. At each iteration, AnchorAL chooses class-specific instances from the labelled set, or *anchors*, and retrieves the most similar unlabelled instances from the pool. This resulting *subpool* is then used for active learning. Using a small, fixed-sized subpool AnchorAL allows scaling any active learning strategy to large pools. By dynamically selecting different anchors at each iteration it promotes class balance and prevents overfitting the initial decision boundary, thus promoting the discovery of new clusters of minority instances. Experiments across different classification tasks, active learning strategies, and model architectures AnchorAL is *(i)* faster, often reducing runtime from hours to minutes, *(ii)* trains more performant models, *(iii)* and returns more balanced datasets than competing methods.
Anthology ID:
2024.naacl-long.467
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8438–8457
Language:
URL:
https://aclanthology.org/2024.naacl-long.467
DOI:
Bibkey:
Cite (ACL):
Pietro Lesci and Andreas Vlachos. 2024. AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8438–8457, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (Lesci & Vlachos, NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.467.pdf
Copyright:
 2024.naacl-long.467.copyright.pdf