Minimizing Annotation Effort via Max-Volume Spectral Sampling

Ariadna Quattoni, Xavier Carreras


Abstract
We address the annotation data bottleneck for sequence classification. Specifically we ask the question: if one has a budget of N annotations, which samples should we select for annotation? The solution we propose looks for diversity in the selected sample, by maximizing the amount of information that is useful for the learning algorithm, or equivalently by minimizing the redundancy of samples in the selection. This is formulated in the context of spectral learning of recurrent functions for sequence classification. Our method represents unlabeled data in the form of a Hankel matrix, and uses the notion of spectral max-volume to find a compact sub-block from which annotation samples are drawn. Experiments on sequence classification confirm that our spectral sampling strategy is in fact efficient and yields good models.
Anthology ID:
2021.findings-emnlp.246
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
EMNLP | Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2890–2899
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.246
DOI:
10.18653/v1/2021.findings-emnlp.246
Bibkey:
Cite (ACL):
Ariadna Quattoni and Xavier Carreras. 2021. Minimizing Annotation Effort via Max-Volume Spectral Sampling. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2890–2899, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Minimizing Annotation Effort via Max-Volume Spectral Sampling (Quattoni & Carreras, Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.246.pdf
Data
AG NewsIMDb Movie Reviews