Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation

Yiping Jin, Akshay Bhatia, Dittaya Wanvarie


Abstract
Weakly-supervised text classification aims to induce text classifiers from only a few user-provided seed words. The vast majority of previous work assumes high-quality seed words are given. However, the expert-annotated seed words are sometimes non-trivial to come up with. Furthermore, in the weakly-supervised learning setting, we do not have any labeled document to measure the seed words’ efficacy, making the seed word selection process “a walk in the dark”. In this work, we remove the need for expert-curated seed words by first mining (noisy) candidate seed words associated with the category names. We then train interim models with individual candidate seed words. Lastly, we estimate the interim models’ error rate in an unsupervised manner. The seed words that yield the lowest estimated error rates are added to the final seed word set. A comprehensive evaluation of six binary classification tasks on four popular datasets demonstrates that the proposed method outperforms a baseline using only category name seed words and obtained comparable performance as a counterpart using expert-annotated seed words.
Anthology ID:
2021.naacl-srw.14
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Month:
June
Year:
2021
Address:
Online
Editors:
Esin Durmus, Vivek Gupta, Nelson Liu, Nanyun Peng, Yu Su
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–118
Language:
URL:
https://aclanthology.org/2021.naacl-srw.14
DOI:
10.18653/v1/2021.naacl-srw.14
Bibkey:
Cite (ACL):
Yiping Jin, Akshay Bhatia, and Dittaya Wanvarie. 2021. Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 112–118, Online. Association for Computational Linguistics.
Cite (Informal):
Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation (Jin et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-srw.14.pdf
Video:
 https://aclanthology.org/2021.naacl-srw.14.mp4
Code
 YipingNUS/OptimSeed