Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification

Lukas Wertz; Katsiaryna Mirylenka; Jonas Kuhn; Jasmina Bogojeska

Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification

Lukas Wertz, Katsiaryna Mirylenka, Jonas Kuhn, Jasmina Bogojeska

Abstract

Large scale, multi-label text datasets with high numbers of different classes are expensive to annotate, even more so if they deal with domain specific language. In this work, we aim to build classifiers on these datasets using Active Learning in order to reduce the labeling effort. We outline the challenges when dealing with extreme multi-label settings and show the limitations of existing Active Learning strategies by focusing on their effectiveness as well as efficiency in terms of computational cost. In addition, we present five multi-label datasets which were compiled from hierarchical classification tasks to serve as benchmarks in the context of extreme multi-label classification for future experiments. Finally, we provide insight into multi-class, multi-label evaluation and present an improved classifier architecture on top of pre-trained transformer language models.

Anthology ID:: 2022.lrec-1.490
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 4597–4605
Language:
URL:: https://aclanthology.org/2022.lrec-1.490/
DOI:
Bibkey:
Cite (ACL):: Lukas Wertz, Katsiaryna Mirylenka, Jonas Kuhn, and Jasmina Bogojeska. 2022. Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4597–4605, Marseille, France. European Language Resources Association.
Cite (Informal):: Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification (Wertz et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.490.pdf
Data: New York Times Annotated Corpus, RCV1

PDF Cite Search Fix data