Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets

Payam Karisani


Abstract
We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate over iterations and consequently the error rate soars. In order to tackle this challenge, we reshape the role of pseudo-labels and create a hierarchical order of information. In addition, a crucial step in self-training is to use the classifier confidence prediction to select the best candidate pseudo-labels. This step cannot be efficiently done by neural networks, because it is known that their output is poorly calibrated. To overcome this challenge, we propose a hybrid metric to replace the plain confidence measurement. Our metric takes into account the prediction uncertainty via a subsampling technique. We evaluate our model in a set of five standard benchmarks, and show that it significantly outperforms a set of ten diverse baseline models. Furthermore, we show that the improvement achieved by our model is additive to language model pretraining, which is a widely used technique for using unlabeled documents.
Anthology ID:
2023.findings-acl.769
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12148–12162
Language:
URL:
https://aclanthology.org/2023.findings-acl.769
DOI:
10.18653/v1/2023.findings-acl.769
Bibkey:
Cite (ACL):
Payam Karisani. 2023. Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12148–12162, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets (Karisani, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.769.pdf