FANATIC : FAst Noise-Aware TopIc Clustering

Extracting salient topics from a collection of documents can be a challenging task when a) the amount of data is large, b) the number of topics is not known a priori , and/or c) “topic noise” is present. We deﬁne “topic noise" as the collection of documents that are irrelevant to any coherent topic and should be ﬁltered out. By design, most clustering algorithms (e.g., k-means , hierarchical clustering) assign all input documents to one of the available clusters, guaranteeing any topic noise to propagate into the result. To address these challenges, we present a novel algorithm, FANATIC , that efﬁciently distinguishes genuine topic documents from those that are topic noise. We also introduce a new Reddit dataset to showcase FANATIC as it contains short, noisy data that is difﬁcult to cluster using most clustering algorithms. We ﬁnd that FANATIC clusters 500k Reddit titles (of which 20% are topic noise) in 2 minutes and achieves an AMI score of 0.59, in contrast with hdbscan (McInnes et al., 2017), a popular algorithm suited for this type of task, which requires over 7 hours and achieves an AMI of 0.03. Finally, we test FANATIC against a Twitter dataset and ﬁnd again that it outperforms the other algorithms with an AMI score of 0.60. We make our code 1 and data publicly available.


Introduction
Every minute, millions of social media data such as Reddit comments, Tweets, Facebook comments, and other content are posted online (Marr, 2018). A cornucopia of value resides in this online information including product feedback, political beliefs, news, trending topics, and social interactions. However, these topics are generally needles in a haystack of topic noise and require suitable algorithms for extracting them. The ability to group short-text documents into topics is an increasingly relevant problem, yet few algorithms are effective because: • the large numbers of documents can become computationally prohibitive; • the number of topics is not known a priori; • a large fraction of documents may be topically irrelevant or idiosyncratic, and should not be assigned to any topic. We henceforth refer to this phenomenon as "topic noise". For social media data, clustering based methods are often favoured over more traditional topic models (Chinnov et al., 2015) like LDA (Blei et al., 2001), however, even within the clustering domain many algorithms struggle. For example, the standard k-means algorithm requires choosing the number of clusters ahead of time or finding the optimal number (which is an NP-hard problem, Mahajan et al., 2009). Therefore, time and/or compute restrictions make k-means infeasible for large datasets. Agglomerative clustering methods do not require specifying the number of clusters, but generally scale poorly with the number of documents, with runtimes of O(n 2 logn) (Gilpin et al., 2013).
Other clustering algorithms better suited to this task are gmeans (Hamerly and Elkan, 2003) and dpmeans (Kulis and Jordan, 2012); instead of specifying the number of clusters one needs only to specify criteria for adding or splitting clusters. gmeans starts with a single cluster and keeps splitting until the child clusters are less Gaussian than their parent. dpmeans specifies a distance length λ and creates new clusters when documents are greater than λ from any existing cluster (See Algorithm 1 from Kulis and Jordan, 2012).
However, most clustering algorithms, including those mentioned above, struggle with topic noise since every input document must be assigned to a cluster. As an example, for the Reddit 2 titles shown in Table 1, only one cluster should be producedthe /r/Hair subreddit captures a single, coherent topic while /r/TheSimpsons subreddit titles are irrelevant to any single topic and should be filtered out as noise.
One option for handling topic noise is to apply a pre-processing step to filter it out before clustering (e.g., Godfrey et al., 2014), but without proper care one can accidentally remove informative "hard-toclassify" documents and/or fail to remove all of the topic noise (Guyon et al., 1996). In addition, each dataset will have its own noise profile warranting a new, detailed analysis per dataset. Some works (e.g., Curiskis et al., 2020) have restricted their analyses to clustering on tight coherent topics with zero topic noise; however, such studies are unlikely to generalize to datasets where topic noise is present.
Instead, a more desirable approach is to add filtering capabilities directly into the clustering algorithm so that clustering and topic noise filtering can be handled together. A key algorithm designed for this purpose is hdbscan (McInnes et al., 2017), which we use as a benchmark for our new algorithm, FANATIC, in Section 5.
A significant challenge in developing clustering algorithms for social media data is acquiring reliable ground truth labels. In particular, obtained labels must reliably distinguish documents from coherent topics and those that are topic noise. However, a common practice when using Twitter data for example is to use the hashtag(s) as the ground truth label (e.g., Benevenuto et al., 2010;Rosa et al., 2011;Curiskis et al., 2020). Since many Twitter hashtags are generic (e.g., #TuesdayThoughts), tweets containing such hashtags can have very little in common with one another (Bruns and Burgess, 2011;Ferragina et al., 2015). For the Reddit do-main, Table 1 illustrates how titles from the /r/Hair subreddit encapsulate a coherent topic while titles from /r/TheSimpsons subreddit are unrelated. Many studies (e.g., Rosa et al., 2011;Conover et al., 2011;Park and Conway, 2018;Curiskis et al., 2020) do not assess the topical coherency of the hashtag/subreddit used as the ground truth label, raising questions about how coherent the associated content is. In addition, collisions between nearby labels (e.g., #photooftheday and #picoftheday) will also downgrade performance since, from a metrics perspective, these identical topics would be considered separate.
Our contributions in this work are as follows: • FANATIC, a clustering algorithm that is fast, does not require specifying the number of clusters a priori, and is robust to topic noise; • a new Reddit-based dataset that reliably distinguishes documents from coherent topics and those that are topic noise; • evaluation of FANATIC against current cluster algorithms suited to social media data: hdbscan, gmeans, dpmeans, and LDA.
2 FANATIC algorithm 2.1 Brief overview of dpmeans FANATIC is built upon the original dpmeans algorithm (Kulis and Jordan, 2012), which works by specifying a cluster diameter, λ. The algorithm is initialized by creating a single cluster whose center is the mean of all of the documents. It then iterates over the documents and assigns each to either a) the nearest cluster provided the distance is less than λ, or b) creates a new cluster with the document's location as the cluster center. This process repeats until convergence. FANATIC enhances dpmeans to ensure robustness to topic noise through several modifications to the original algorithm. A description of the modifications and associated parameters are described in the subsections below. The distance function, D, is either cosine or Euclidean and convergence is achieved when the document-weighted average change in cluster centers falls below a specified threshold. The complete algorithm is outlined in Algorithm 1.

Minimum Token Probability
For text-based clustering it is typical to cluster on word embeddings, yet embeddings of rare words are ineffective and often clump together (e.g., Gong Algorithm 1 FANATIC INPUT: d 1 , .., d n ∈ D: set of n documents PARAMETERS: λ: cluster size L: token probability threshold D: distance function (cosine or Euclidean) N C : maximum number of clusters S C : minimum cluster size M R : number of cluster-merge rounds M d : merge distance between two clusters OUTPUT: y 1 , .., y n : cluster assignments for each document, C: total number of clusters 1: Initialize: C=1, µ={µ 1 } s.t. µ 1 = global mean 2: while True do Create new cluster:  , 2018). Thus, without proper care, disparate content can cluster together simply because they contain rare words. Social media data is particularly rife with rare words due to misspellings, abbreviations, acronyms, special characters, etc. (Chinnov et al., 2015;Curiskis et al., 2020).
Therefore, in addition to distance requirement λ for adding a document to a cluster, we add an additional token-based requirement that a document's tokens must be "sufficiently close" (defined in Equation 3) to the cluster's tokens. This feature encodes the intuition that, not only do we want to group documents that are close in embedding space, but additionally we want their raw tokens to be similar as well. This can significantly improve the purity of clusters as their formation no longer relies solely on the quality of the embedding space.
First we define P c,t to be the token probability for token t in cluster c, where D c is the set of documents in cluster c, and T d is the set of tokens in document d.
Defining T c,d as the set of common tokens between the documents in cluster c and a new document d, we then calculate the token probability of document d with respect to cluster c by summing the individual token probabilities of cluster c for each token in T c,d , normalized by the total number of tokens in document d: Finally, document d is only added to cluster c if P c,d ≥ L (3) where L ∈ [0, 1], the token probability threshold, is a tunable parameter. The token probabilities of a cluster, P c,t , are re-calculated every time a new cluster is created or the cluster center is updated. Equation 3 is used during step 6 of Algorithm 1.

Cluster Merging
While iterating over the data, cluster centers can gradually move toward higher density space and find themselves within λ of other clusters. This can result in similar and/or duplicate clusters with arbitrary decision boundaries. Performance can be improved by merging such overlapping clusters.
Cluster merging proceeds in rounds, where the number of rounds, M R , is a tunable parameter. At a high-level, each round commences by first finding all pairwise distances between clusters. Next, cluster pairs are greedily chosen in order of ascending pairwise distance. If the distance between the two clusters is less than λM d , where M d ∈ [0, 1] is a tunable parameter, the clusters will be merged. A cluster may only be merged once per round, as allowing for multiple can result in merges cascading into a single (or several) large, ambiguous clusters. When a merge occurs the new cluster center becomes the document-weighted average of the two child clusters, while the cluster diameter remains fixed at λ. Cluster merging occurs during step 15 of Algorithm 1.

Post-Cluster Filtering of Small Clusters and Document Reassignment
After clustering is complete we filter out clusters that have fewer than S C documents, a tunable parameter, under the intuition that they likely encap-sulate highly specific and/or idiosyncratic topics.
To ensure that we do not lose valuable documents during this filtering process, we perform a final assignment step where documents from filtered clusters can be re-assigned to a remaining cluster if the criteria in step 6 of Algorithm 1 is met. This serves as an additional method of cleaning up topic noise by removing small clusters, but taking relevant, topic-coherent documents out of them and adding them back into the large clusters they belong to. Cluster filtering and reassignment is done during step 17 of Algorithm 1.

Limiting Number of Clusters During Clustering
To accommodate the fact that, a priori, the true number of clusters in the dataset is unknown, we introduce the tunable parameter N C , an upper bound on the total number of clusters. It allows for more flexibility than algorithms where number of clusters is fixed (e.g., k-means). Specifically, N C : • allows documents to be classified as topic noise/outliers since if a document doesn't belong to an existing cluster but N C is reached, the document remains unassigned. Without N C , a new cluster would always be created; • acts as a form of regularization, forcing fewer clusters to find an optimal configuration; • speeds up document assignment. Once N C is reached the remaining documents can be assigned in parallel.

Data
We evaluate our algorithm's performance on the Pushshift Reddit dataset (Baumgartner et al., 2020), as it is publicly available and suitable for clustering. Specifically, the Reddit platform is organized into categories, or subreddits, which generally focus on a single topic, have a title, and contain a large number of user posts. We use the titles of posts from selected subreddits as input documents for clustering, while the cluster labels are derived from the subreddit via an annotation task described below.

Annotation Task to Extract Coherent Subreddits
As mentioned in Section 1, obtaining ground truth labels requires care due the fact that many subreddits are, topically speaking, very general (e.g., /r/Showerthoughts), and especially so when considering only the title of the post without additional context (see Table 1). Here we define an annotation task with the goal of identifying those subreddits which encapsulate a single "coherent" topic and those which do not, which we label as "noise".

Topic Definition
We acknowledge upfront that many valid definitions of "topic" exist, and future users are encouraged to try others as FANATIC is not tied to a particular one. In this work we follow Guille et al. (2013) who define a topic as "a coherent set of semantically related terms that express a single argument". We apply this definition to our annotation task (and downstream clustering) such that a topic must be characterized by a central noun (e.g., "sports", "cooking", "fitness"), and cannot be defined by a central adjective (e.g., "happy", "cute", "interesting"). It's possible that some of the subreddits we assign as "noise" are in fact coherent topics when viewed holistically on www.reddit.com (e.g., "/r/TheSimpsons"). However, importantly, in this work we only considered the title of each post and disregarded all other content (pictures, text body, comments, etc.). Therefore, since our dataset has been significantly mutated from the original content, it's possible that some annotation labels may deviate from human expectation.

Task Design
For this annotation task, we randomly sample 1000 subreddits. From each subreddit, we randomly sample forty posts and have six annotators evaluate random subsets of twenty posts from the selected forty posts. When presenting posts to the annotator we omit the subreddit label to avoid biasing the annotator (e.g., /r/Showerthoughts gives context to otherwise unrelated posts).
We ask annotators to evaluate topic coherency by answering two questions: a) Do the majority of the titles (sampled from a single subreddit) represent a coherent topic and, if so, b) provide a short summary for the topic.
We crowdsource our annotations using a leading commercial crowdsourcing platform where anonymized annotators are sampled randomly from around the world. We utilize quality control features which exclude low performing contributors on golden test questions, as well other quality control measures described in more detail in Appendix A.1.2.

Extracting Annotation Label
We limit the final set of subreddits to only those which were annotated consistently across all six annotators to increase reliability of our results. We define subreddits in which all annotators answered question a) with "yes" as "coherent" subreddits, and those in which all annotators answered question a) with "no" as "noise" subreddits. As an additional quality control step, we examine the annotator-provided summaries and manually filter out any subreddits whose summaries did not unanimously describe a single semantic topic.
Although we now have high confidence as to which subreddits encapsulate coherent topics and which are topical noise, we still have not accounted for the fact that subreddits can overlap in content, and a particular reddit post could (and often does) belong to many subreddits. It's important to account for this overlap when assigning cluster labels to avoid unfair penalization in downstream metrics.
Therefore, as the final filtering step, through a combination of TF-IDF analysis and manual vetting, we remove subreddits which are semantically similar (e.g., /r/Hair and /r/curlyhair), and always remove the smaller of the two subreddits.

Final Dataset
After the aforementioned annotation procedure, our dataset is finalized to 25 coherent subreddits and 67 noise ones, which are listed in Appendix A.1.1. For the remainder of this work we restrict to titles from these subreddits.

Preprocessing and Embeddings
All Reddit titles are preprocessed by a) normalizing urls, numbers, @mentions, emoticons, dollar amounts, emails and phone numbers, b) lowercasing, c) tokenizing and d) filtering out stopwords using NLTK's 3 standard stopword list.
Using a trained Word2Vec model (Mikolov et al., 2013), each title's tokens are embedded and averaged into a single vector. Although more sophisticated techniques exist for combining tokenlevel embeddings into document-level embeddings (e.g., Arora et al., 2017;De Boom et al., 2016), these methods generally depend on term-frequency statistics which can be unreliable in noisy social media data (spelling mistakes, slang, etc.). Furthermore, a simple average often performs compet-itively on short texts (Wieting et al., 2016). The Word2Vec model was trained via gensim (Řehůřek and Sojka, 2010) on the RS_2017-08.bz2 -RS_2017-11.bz2 data files using a standard embedding size of 300 and window size of 5. We find downstream results insensitive to changes in Word2Vec hyperparameters, likely due to the short nature of each Reddit title.

Alternative Featurizations
Since FANATIC only relies on embeddings and tokens for clustering, future users are encouraged to featurize however they wish provided a static embedding vector and token set can be generated per document. For example, one could switch to use contextual embeddings (e.g., Reimers and Gurevych, 2019) instead of Word2Vec, and the code 4 has been specifically modularized to accommodate alternative preprocessings. Our choice of Word2Vec stemmed from a need for a strong baseline embedding model to showcase FANATIC's potential. FANATIC should still perform regardless of featurization strategy.

Clustering Algorithms
The documents are then clustered using each of the following clustering algorithms until convergence: • FANATIC (see Section 2) • dpmeans (Kulis and Jordan, 2012) • gmeans (Hamerly and Elkan, 2003) • hdbscan (McInnes et al., 2017) • LDA (Blei et al., 2001) For gmeans, dpmeans and LDA we add an additional hyperparameter to filter out clusters smaller than S c after the algorithm completes (FANATIC and hdbscan already have this feature.). Without this added feature gmeans, dpmeans and LDA would have no opportunity to filter out noise. We emphasize that when S c = 0, this added feature is disabled and the algorithms return to their original implementations. If this scenario is preferred it should be selected during hyperparameter tuning.

Labeling Noise Documents
Unlike the supervised classification domain where standard metrics like precision, recall, and f1 are reliable measures of performance, there are no equivalent one-size-fits-all metrics for the clustering domain (Romano et al., 2016). This is especially true when considering topic noise, where Amigó et al. (2009) show that almost all clustering metrics fail the "rag bag" scenario 5 , which occurs when the data contains a collection of disparate items that should not be grouped with the other items (think "miscellaneous", "other", or in our case, "topic noise").
To best handle topic noise in this work we assign the same NOISE label to all Reddit titles from "noise" subreddits. From a metrics perspective, this consolidates the rag bag of noise documents into a single cluster label, encouraging them to be grouped together. This is an ideal labeling scheme for filtering topic noise, however, as we will see in Section 5.1, it can also encourage disparate NOISE content to group together in clusters since they share the same label, which is not ideal.
An alternative noise labeling scheme could be to assign a unique label to each noise document; however this would dramatically increase the number of labels and result in extreme label imbalances, which is very challenging for cluster metrics to handle (e.g., de Souto et al., 2012).

Performance Metrics
To select "best" runs and measure how well similarly-labeled documents are grouped together, we use the well-established Adjusted-Mutual Information, or AMI (Vinh et al., 2010). We select AMI because its baseline is a) adjusted for chance, b) robust to changes in the number of clusters and/or documents (Vinh et al., 2010;Meilȃ, 2007), and c) fast to compute. Other metrics such as V-measure (Rosenberg and Hirschberg, 2007), Fowlkes-Mallows (Fowlkes and Mallows, 1983) and B-Cubed (Bagga and Baldwin, 1998) do not have such properties (Meilȃ, 2007;Gösgens et al., 2019;scikit-learn developers, 2020).
For each run we also measure: • pseudo-precision, P * , which tracks the contamination of topic noise in clusters. • pseudo-recall, R * , which tracks how well documents from coherent topics are retained in clusters vs. filtered out as noise. These are calculated as: Where tp * is the set of documents from coherent topics that ended up in a cluster, f p * is the set of noise documents that ended up in a cluster, and f n * is the set of documents from coherent topics that did not end up in any cluster. These are pseudo values since they only track whether a document ended up in any cluster vs. the correct cluster. However, since topic noise should not end up in any cluster, these metrics allow us to track the contamination of topic noise in clusters and determine how robust each clustering algorithm is at filtering it. A lower P * implies that more noise documents are contaminating clusters, while a lower R * implies that the more documents from coherent topics are being excluded from clusters.

Amount of Noise vs. Performance
In our first experiment we fix the number of documents, N D , to 50k and vary the fraction of documents that are topic noise, f n . The two questions we want to answer are how well each algorithm: • groups similarly-labeled documents together; • filters topic noise. The first question is answered via the AMI score, while the second is answered via the P * and R * scores, which (as mentioned in Section 4.3.2) monitor how an algorithm filters noise while retaining valid content in clusters.
At each (algorithm, f n ) combination we run 250 experiments randomly sampling over the algoithm's hyperparameters, and select the best run as the highest AMI score. Best results for each (algorithm, f n ) combination are displayed in Figure 1. See Appendix A.1.3 for additional experimental details.

FANATIC vs. dpmeans
The top panel of Figure 1 shows that FANATIC and dpmeans both have the highest AMI scores, indicating equal ability to group similarly-labeled documents together. We emphasize that the AMI score includes the grouping of noise documents as they all share the same NOISE label. As f n increases, this grouping of topic noise becomes an increasingly dominant component of the overall AMI score (e.g., at f n = 0.5, 50% of the clusterable content is topic noise).
The final three rows of Figure 1 show how, although dpmeans and FANATIC have equal AMI scores, FANATIC is superior at filtering topic noise from clusters for two key reasons: • For all experiments the pseudo-precision, P * (third row), is noticeably higher for FANATIC while still maintaining high pseudo-recall, R * (last row). This indicates that FANATIC does a better job of filtering noise while keeping valid documents in clusters. In contrast dpmeans has the lowest P * of any algorithm indicating poor ability to filter out noise. • The fraction of documents clustered, f D (second row), for dpmeans is approximately 1 regardless of the amount of noise present, f n . This means that, although dpmeans can effectively group noise documents together (it has a high AMI score), this noise is contaminating clusters instead of being filtered. In contrast, for FANATIC f D is proportional to f n , illustrating how it filters more documents when more noise is present. These findings are qualitatively highlighted in Tables 2 and 3 for the best performing f n = 0.4 runs for FANATIC and dpmeans, respectively, and show eight randomly sampled documents from the cluster with the most "Hair" subreddit mentions. As can be seen, although the dpmeans cluster contains roughly equal number of NOISE and "Hair" labels (yielding a good AMI score),  Back to the color I love! Table 3: dpmeans: Eight randomly sampled documents from the cluster with the most "Hair" label mentions for the best performing f n = 0.4 run.
the cluster itself carries little topical coherency. In contrast, the FANATIC cluster clearly shows a valid "Hair" topic, and even the contamination (e.g., "giveaways" label) contains relevant content.

FANATIC vs. hdbscan, gmeans, LDA
FANATIC achieves better AMI scores at all f n than hdbscan, gmeans and LDA, with the greatest performance difference occurring at f n = 0. This indicates its superior ability to group similar documents together, especially in the absence of any topic noise. The other algorithms generally struggle to achieve the trifecta of high AMI, P * and R * , and also tend to cluster the same fraction of documents, f D , independent of the amount of noise present, f n (second row in Figure 1), indicating little sensitivity to filtering out topic noise.
Interestingly, hdbscan consistently achieves the highest P * with very low R * , indicating that it is a harsh filter -it can reliably filter noise documents but tends to discard relevant documents.

Number of Documents vs. Performance
We take the best performing runs from Section 5.1 at f n = 0.2 and exponentially increase the number of documents, N D , to answer how each algorithm: • is affected by data perturbation; • scales computationally with N D . Best results for each (N D , algorithm) combination are displayed in Figure 2, and shows that FANATIC again performs best and is robust to changes in N D . In particular: • AMI, f D , P * and R * do not change as a function of N D indicating high stability. • It has the highest AMI score (tied with dpmeans). • The fraction of documents clustered, f D is proportional to the fraction of noise (in particular, f D = 1 − f n = 1 − 0.2 = 0.8).
• P * and R * are both high, indicating it can filter noise while keeping valid documents in clusters. All other algorithms show some additional drawback, including lower AMI score (gmeans, LDA, hdbscan), disproportionate cluster fraction, f D (all other algorithms), lower pseudo-precision, P * (LDA, dpmeans, mostly hdbscan), lower document-recall, R * (gmeans, partially hdbscan), or instability of results as N D changes (hdbscan).

Computational Efficiency
FANATIC, LDA and dpmeans all scale computationally very efficiently as shown in the bottom row of Figure 2 which plots clustering time (in seconds) vs. N D . In particular FANATIC is two orders of magnitude faster than hdbscan, and at N D =500k hdbscan takes over 7 hours while FANATIC takes 2 minutes. The slowness of hdbscan is likely due to the 300-dimensional embeddings (typical for the NLP domain, e.g., Pennington et al., 2014), and others in the community have also noticed that scaling for hdbscan degrades as embedding dimension increases (Leland McInnes, 2018).
To evaluate how FANATIC generalizes to other datasets, we briefly test on a collection of 20k tweets collected over 2019-12-18 to 2019-12-21 via the Twitter API 6 . These tweets span 20 hashtags (see Appendix A.2.1 for the list) which were manually vetted to be topically coherent and disparate from each other. Since each hashtag represents a coherent topic, in this experiment there is no "topic noise", and by default P * = 1. Tweets are preprocessed into documents, clustered and evaluated in an identical manner to the Reddit data (see Section 4).
The results in Table 4 show that FANATIC again has the highest AMI, highlighting its superior ability to cluster similar documents together. It also has the highest pseudo-recall, R * , indicating that it classified almost no documents as noise, as it ideally should since no topic noise is present. In Appendix A.2.2 we show five randomly sampled documents from the cluster with the most "#crypto" mentions for each algorithm's best run. These samples qualitatively match the findings from Section 5.1: namely that FANATIC is best at grouping similarly-labeled documents together, followed by dpmeans. hdbscan again tends to act as a "harsh filter", yielding precise clusters at the cost of filtering significant amounts of valuable content.

Conclusion
In this paper we present the FANATIC algorithm that is capable of robustly extracting coherent topics, even in the presence of topic noise. We first showed that AMI scores for FANATIC were consistently high across three experiments, indicating general ability to group similar documents together. Second, we showed that pseudo-precision, P * , and pseudo-recall, R * , were consistently high, demonstrating its robustness to detect and filter topic noise. Third, we demonstrated that FANATIC's consistent performance as the noise fraction, f n , increased over the total number of documents displayed its robustness to different scenarios. As an added advantage, FANATIC performed best with zero topic noise. Finally, we found FANATIC to be two orders of magnitude faster than hdbscan, demonstrating its scalability and efficiency in the NLP domain.
We particularly recommend FANATIC over other clustering algorithms if the number of documents is large and/or topic noise is present.

A.1 Supplemental Material for Reddit
A.1.1 Dataset: List of Subreddits Used Table 5 displays the list of "coherent" and "noise" subreddits in our Reddit dataset used throughout our experiments in Sections 5.1 and 5.2. "Coherent" subreddits encapsulate a single topic while "noise" subreddits lack identifiable topics. We remind the reader that these topic labels have been derived solely from Reddit post titles, with all additional content (pictures, text body, comments, etc.) discarded. See Section 3.1.1 for additional discussion.  Although ∼90% of subreddits have been filtered (from the original set of 1000), potentially introducing bias, this is a worthwhile tradeoff as our final dataset enables reliable assessment of cluster quality in the presence of topic noise, a valuable measurement previously absent from the community.

Coherent subreddits
Our downstream results assume that the labels obtained from our annotation task generalize to the entire subreddit. While in general this is an effective way to obtain labels for thousands of documents (which are infeasible to annotate individually), it is possible that some subreddits have been mislabelled.
We have tried to minimize this possibility by using the strictest possible filters as described in Section 3.1.3, namely unanimous annotator agreement and semantic agreement on the provided summary. We also restricted the data to the 2017 year to minimize potential distribution shift.
It is also possible that some fraction of the documents within a coherent subreddit are actually noise. During our annotation task we allowed annotators to select documents from coherent subreddits that did not belong. We found that, on average, 0.8 ± 2.1 of the 20 titles shown were selected, suggesting that the fraction of misannotated coherent documents is low. For our noise subreddits, since all 67 of them were uniquely annotated as noise and will be combined into a single NOISE label (see Section 4.3.1), no individual subreddit contributes greatly to the whole, mitigating risk of any one subreddit having been misannotated. In general, as long as the misannotated fraction is low its effect on downstream metrics will also be low.
Overall, we have taken considerable care in ensuring that the labels reflect the topical content.

A.1.3 Experimental Details
For the experiments in Sections 5.1 we use the RS_2017-11.bz2 data file from Pushshift 7 which contains Reddit data from November 2017, while for the experiments in Section 5.2 we use the RS_2017-01.bz2 -RS_2017-11.bz2 data files (January through November 2017 data), as needed, depending on the amount of required data for the experiments.
All experiments and derived clustering times were run with 4 CPU cores and and 50GB memory. Experiments that took over two days to run or required additional memory were not considered for the paper (in practice this only restricted the longest gmeans runs). Table 6 shows the min/max range and scale for each FANATIC hyperparameter. "lin.", "log" and "int" correspond to linear, logarithmic and postitive integers. λ cos and λ euc correspond to λ when cosine and Euclidean distances were selected. These ranges were set by a) getting initial results from very broad hyperparameter ranges and b) reducing the hyperparameter space to exclude ranges with consistently very poor results for improved efficiency.
λ cos λ euc L N C S C M R M d min 0.1 1 0 10 1 10 1 0 0 max 1 3.5 0.3 10 3 10 3 1 1 scale lin. lin. lin. log log int lin. The best FANATIC hyperparameters are listed in Table 7 for each (N D , f n ) combination. Numbers are rounded to two significant digits, and when M R = 0 then M d = 0 by default. For the distance function, D, λ cos and λ euc correspond to λ when cosine and Euclidean distances were selected.   Table 8 displays the list of 20 Twitter hashtags used for our experiment in Section 5.3. Each hashtag was manually vetted to be both topically coherent and disparate from other hashtags.