Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora

Andrius Mudinas; Dell Zhang; Mark Levene

doi:10.1162/tacl_a_00020

Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora

Andrius Mudinas, Dell Zhang, Mark Levene

Abstract

There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibility of building domain-specific sentiment classifiers with unlabeled documents only. Our investigation indicates that in the word embeddings learned from the unlabeled corpus of a given domain, the distributed word representations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Exploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific sentiment lexicon from just a few typical sentiment words (“seeds”). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophisticated semi-supervised/transductive learning algorithms which represent the state-of-the-art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment classification, but a higher performance could be achieved through a two-phase bootstrapping method which uses the induced lexicon to assign positive/negative sentiment scores to unlabeled documents first, a nd t hen u ses those documents found to have clear sentiment signals as pseudo-labeled examples to train a document sentiment classifier v ia supervised learning algorithms (such as LSTM). On several benchmark datasets for document sentiment classification, our end-to-end pipelined approach which is overall unsupervised (except for a tiny set of seed words) outperforms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches.

Anthology ID:: Q18-1020
Volume:: Transactions of the Association for Computational Linguistics, Volume 6
Month:
Year:: 2018
Address:: Cambridge, MA
Editors:: Lillian Lee, Mark Johnson, Kristina Toutanova, Brian Roark
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 269–285
Language:
URL:: https://aclanthology.org/Q18-1020/
DOI:: 10.1162/tacl_a_00020
Bibkey:
Cite (ACL):: Andrius Mudinas, Dell Zhang, and Mark Levene. 2018. Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora. Transactions of the Association for Computational Linguistics, 6:269–285.
Cite (Informal):: Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora (Mudinas et al., TACL 2018)
Copy Citation:
PDF:: https://aclanthology.org/Q18-1020.pdf

PDF Cite Search Fix data