Arik Reuter
2024
Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion
Anton Thielmann
|
Arik Reuter
|
Quentin Seifert
|
Elisabeth Bergherr
|
Benjamin Säfken
Computational Linguistics, Volume 50, Issue 2 - June 2023
Extracting and identifying latent topics in large text corpora have gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared with state-of-the-art topic modeling and document clustering models. The code is available at the following link: https://github.com/AnFreTh/STREAM.
STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module
Anton Thielmann
|
Arik Reuter
|
Christoph Weisser
|
Gillian Kant
|
Manish Kumar
|
Benjamin Säfken
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Topic modeling is a widely used technique to analyze large document corpora. With the ever-growing emergence of scientific contributions in the field, non-technical users may often use the simplest available software module, independent of whether there are potentially better models available. We present a Simplified Topic Retrieval, Exploration, and Analysis Module (STREAM) for user-friendly topic modelling and especially subsequent interactive topic visualization and analysis. For better topic analysis, we implement multiple intruder-word based topic evaluation metrics. Additionally, we publicize multiple new datasets that can extend the so far very limited number of publicly available benchmark datasets in topic modeling. We integrate downstream interpretable analysis modules to enable users to easily analyse the created topics in downstream tasks together with additional tabular information.The code is available at the following link: https://github.com/AnFreTh/STREAM
Search
Co-authors
- Anton Thielmann 2
- Benjamin Säfken 2
- Quentin Seifert 1
- Elisabeth Bergherr 1
- Christoph Weisser 1
- show all...