Topic Modeling: Contextual Token Embeddings Are All You Need

Dimo Angelov, Diana Inkpen


Abstract
The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models on a comprehensive set of topic model evaluation metrics.
Anthology ID:
2024.findings-emnlp.790
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13528–13539
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.790
DOI:
Bibkey:
Cite (ACL):
Dimo Angelov and Diana Inkpen. 2024. Topic Modeling: Contextual Token Embeddings Are All You Need. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13528–13539, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Topic Modeling: Contextual Token Embeddings Are All You Need (Angelov & Inkpen, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.790.pdf