Evaluating Unsupervised Hierarchical Topic Models Using a Labeled Dataset

Judicael Poumay; Ashwin Ittoo

Evaluating Unsupervised Hierarchical Topic Models Using a Labeled Dataset

Abstract

Topic modeling is a commonly used method for identifying and extracting topics from a corpus of documents. While several evaluation techniques, such as perplexity and topic coherence, have been developed to assess the quality of extracted topics, they fail to determine whether all topics have been identified and to what extent they have been represented. Additionally, hierarchical topic models have been proposed, but the quality of the hierarchy produced has not been adequately evaluated. This study proposes a novel approach to evaluating topic models that supplements existing methods. Using a labeled dataset, we trained hierarchical topic models in an unsupervised manner and used the known labels to evaluate the accuracy of the results. Our findings indicate that labels encompassing a substantial number of documents achieve high accuracy of over 70%. Although there are 90 labels in the dataset, labels that cover only 1% of the data still achieve an average accuracy of 37.9%, demonstrating the effectiveness of hierarchical topic models even on smaller subsets. Furthermore, we demonstrate that these labels can be used to assess the quality of the topic tree and confirm that hierarchical topic models produce coherent taxonomies for the labels.

Anthology ID:: 2023.ranlp-1.91
Volume:: Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Editors:: Ruslan Mitkov, Galia Angelova
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 846–853
Language:
URL:: https://aclanthology.org/2023.ranlp-1.91/
DOI:
Bibkey:
Cite (ACL):: Judicael Poumay and Ashwin Ittoo. 2023. Evaluating Unsupervised Hierarchical Topic Models Using a Labeled Dataset. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 846–853, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Evaluating Unsupervised Hierarchical Topic Models Using a Labeled Dataset (Poumay & Ittoo, RANLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.ranlp-1.91.pdf

PDF Cite Search Fix data