Improving Topic Models with Latent Feature Word Representations

Dat Quoc Nguyen, Richard Billingsley, Lan Du, Mark Johnson


Abstract
Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
Anthology ID:
Q15-1022
Original:
Q15-1022v1
Version 2:
Q15-1022v2
Erratum e1:
Q15-1022e1
Volume:
Transactions of the Association for Computational Linguistics, Volume 3
Month:
Year:
2015
Address:
Cambridge, MA
Editors:
Michael Collins, Lillian Lee
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
299–313
Language:
URL:
https://aclanthology.org/Q15-1022/
DOI:
10.1162/tacl_a_00140
Bibkey:
Cite (ACL):
Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, 3:299–313.
Cite (Informal):
Improving Topic Models with Latent Feature Word Representations (Nguyen et al., TACL 2015)
Copy Citation:
PDF:
https://aclanthology.org/Q15-1022.pdf
Video:
 https://aclanthology.org/Q15-1022.mp4