Comparing Apples to Apple: The Effects of Stemmers on Topic Models

Alexandra Schofield, David Mimno


Abstract
Rule-based stemmers such as the Porter stemmer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algorithms. We examine several different quantitative measures of the resulting models, including likelihood, coherence, model stability, and entropy. Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability.
Anthology ID:
Q16-1021
Volume:
Transactions of the Association for Computational Linguistics, Volume 4
Month:
Year:
2016
Address:
Cambridge, MA
Editors:
Lillian Lee, Mark Johnson, Kristina Toutanova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
287–300
Language:
URL:
https://aclanthology.org/Q16-1021/
DOI:
10.1162/tacl_a_00099
Bibkey:
Cite (ACL):
Alexandra Schofield and David Mimno. 2016. Comparing Apples to Apple: The Effects of Stemmers on Topic Models. Transactions of the Association for Computational Linguistics, 4:287–300.
Cite (Informal):
Comparing Apples to Apple: The Effects of Stemmers on Topic Models (Schofield & Mimno, TACL 2016)
Copy Citation:
PDF:
https://aclanthology.org/Q16-1021.pdf
Data
New York Times Annotated Corpus