Leveraging the Structure of Pre-trained Embeddings to Minimize Annotation Effort

Cesar Gonzalez-Gutierrez, Ariadna Quattoni


Abstract
Most current state-of-the-art approaches for text classification are based on fine-tuning the representations computed by large language models (LLMs). This strategy has led to significant improvements in classification performance and contributed to a reduction of the amount of labeled data required for training a model. However, for some challenging classification tasks, providing enough annotations to ensure a reliable classification continues to be the main bottleneck. This is especially true in settings of highly imbalanced class distributions. This paper proposes to tackle this bottleneck by exploiting the structural properties of pre-trained embeddings. We develop a label propagation method that uses pre-trained embeddings to spread information from the labeled samples to nearby samples in the induced space, ensuring the optimal use of annotations. Our approach is simple and relatively low-cost since it only requires computing some distances in the embedded space. We conduct experiments on different text classification datasets showing that the proposed method is efficient and significantly outperforms both self-training and random walk label propagation strategies.
Anthology ID:
2024.naacl-long.387
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6996–7010
Language:
URL:
https://aclanthology.org/2024.naacl-long.387
DOI:
Bibkey:
Cite (ACL):
Cesar Gonzalez-Gutierrez and Ariadna Quattoni. 2024. Leveraging the Structure of Pre-trained Embeddings to Minimize Annotation Effort. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6996–7010, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Leveraging the Structure of Pre-trained Embeddings to Minimize Annotation Effort (Gonzalez-Gutierrez & Quattoni, NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.387.pdf
Copyright:
 2024.naacl-long.387.copyright.pdf