A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

Ximing Li, Bo Yang


Abstract
Traditional supervised text classifiers require a large number of manually labeled documents, which are often expensive to obtain. Recently, dataless text classification has attracted more attention, since it only requires very few seed words of categories that are much cheaper. In this paper, we develop a pseudo-label based dataless Naive Bayes (PL-DNB) classifier with seed words. We initialize pseudo-labels for each document using seed word occurrences, and employ the expectation maximization algorithm to train PL-DNB in a semi-supervised manner. The pseudo-labels are iteratively updated using a mixture of seed word occurrences and estimations of label posteriors. To avoid noisy pseudo-labels, we also consider the information of nearest neighboring documents in the pseudo-label update step, i.e., preserving local neighborhood structure of documents. We empirically show that PL-DNB outperforms traditional dataless text classification algorithms with seed words. Especially, PL-DNB performs well on the imbalanced dataset.
Anthology ID:
C18-1162
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Emily M. Bender, Leon Derczynski, Pierre Isabelle
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1908–1917
Language:
URL:
https://aclanthology.org/C18-1162
DOI:
Bibkey:
Cite (ACL):
Ximing Li and Bo Yang. 2018. A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1908–1917, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words (Li & Yang, COLING 2018)
Copy Citation:
PDF:
https://aclanthology.org/C18-1162.pdf