Anja Subasic
2021
FANATIC: FAst Noise-Aware TopIc Clustering
Ari Silburt
|
Anja Subasic
|
Evan Thompson
|
Carmeline Dsilva
|
Tarec Fares
Findings of the Association for Computational Linguistics: EMNLP 2021
Extracting salient topics from a collection of documents can be a challenging task when a) the amount of data is large, b) the number of topics is not known a priori, and/or c) “topic noise” is present. We define “topic noise” as the collection of documents that are irrelevant to any coherent topic and should be filtered out. By design, most clustering algorithms (e.g. k-means, hierarchical clustering) assign all input documents to one of the available clusters, guaranteeing any topic noise to propagate into the result. To address these challenges, we present a novel algorithm, FANATIC, that efficiently distinguishes documents from genuine topics and those that are topic noise. We also introduce a new Reddit dataset to showcase FANATIC as it contains short, noisy data that is difficult to cluster using most clustering algorithms. We find that FANATIC clusters 500k Reddit titles (of which 20% are topic noise) in 2 minutes and achieves an AMI score of 0.59, in contrast with hdbscan (McInnes et al., 2017), a popular algorithm suited for this type of task, which requires over 7 hours and achieves an AMI of 0.03. Finally, we test FANATIC against a Twitter dataset and find again that it outperforms the other algorithms with an AMI score of 0.60. We make our code and data publicly available.
Search