Judicael Poumay


2023

pdf bib
Evaluating Unsupervised Hierarchical Topic Models Using a Labeled Dataset
Judicael Poumay | Ashwin Ittoo
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Topic modeling is a commonly used method for identifying and extracting topics from a corpus of documents. While several evaluation techniques, such as perplexity and topic coherence, have been developed to assess the quality of extracted topics, they fail to determine whether all topics have been identified and to what extent they have been represented. Additionally, hierarchical topic models have been proposed, but the quality of the hierarchy produced has not been adequately evaluated. This study proposes a novel approach to evaluating topic models that supplements existing methods. Using a labeled dataset, we trained hierarchical topic models in an unsupervised manner and used the known labels to evaluate the accuracy of the results. Our findings indicate that labels encompassing a substantial number of documents achieve high accuracy of over 70%. Although there are 90 labels in the dataset, labels that cover only 1% of the data still achieve an average accuracy of 37.9%, demonstrating the effectiveness of hierarchical topic models even on smaller subsets. Furthermore, we demonstrate that these labels can be used to assess the quality of the topic tree and confirm that hierarchical topic models produce coherent taxonomies for the labels.

pdf bib
HTMOT: Hierarchical Topic Modelling over Time
Judicael Poumay | Ashwin Ittoo
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Topic models provide an efficient way of extracting insights from text and supporting decision-making. Recently, novel methods have been proposed to model topic hierarchy or temporality. Modeling temporality provides more precise topics by separating topics that are characterized by similar words but located over distinct time periods. Conversely, modeling hierarchy provides a more detailed view of the content of a corpus by providing topics and sub-topics. However, no models have been proposed to incorporate both hierarchy and temporality which could be beneficial for applications such as environment scanning. Therefore, we propose a novel method to perform Hierarchical Topic Modelling Over Time (HTMOT). We evaluate the performance of our approach on a corpus of news articles using the Word Intrusion task. Results demonstrate that our model produces topics that elegantly combine a hierarchical structure and a temporal aspect. Furthermore, our proposed Gibbs sampling implementation shows competitive performance compared to previous state-of-the-art methods.

2021

pdf bib
A Comprehensive Comparison of Word Embeddings in Event & Entity Coreference Resolution.
Judicael Poumay | Ashwin Ittoo
Findings of the Association for Computational Linguistics: EMNLP 2021

Coreference Resolution is an important NLP task and most state-of-the-art methods rely on word embeddings for word representation. However, one issue that has been largely overlooked in literature is that of comparing the performance of different embeddings across and within families. Therefore, we frame our study in the context of Event and Entity Coreference Resolution (EvCR & EnCR), and address two questions : 1) Is there a trade-off between performance (predictive and run-time) and embedding size? 2) How do the embeddings’ performance compare within and across families? Our experiments reveal several interesting findings. First, we observe diminishing returns in performance with respect to embedding size. E.g. a model using solely a character embedding achieves 86% of the performance of the largest model (Elmo, GloVe, Character) while being 1.2% of its size. Second, the larger models using multiple embeddings learns faster despite being slower per epoch. However, it is still slower at test time. Finally, Elmo performs best on both EvCR and EnCR, while GloVe and FastText perform best in EvCR and EnCR respectively.