Mateusz Gniewkowski
2021
Text Document Clustering: Wordnet vs. TF-IDF vs. Word Embeddings
Michał Marcińczuk
|
Mateusz Gniewkowski
|
Tomasz Walkowiak
|
Marcin Będkowski
Proceedings of the 11th Global Wordnet Conference
In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT) with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experiments’ results showed that wordnet-based similarity measures could compete and even outperform modern embedding-based approaches.
2019
Evaluation of vector embedding models in clustering of text documents
Tomasz Walkowiak
|
Mateusz Gniewkowski
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
The paper presents an evaluation of word embedding models in clustering of texts in the Polish language. Authors verified six different embedding models, starting from widely used word2vec, across fastText with character n-grams embedding, to deep learning-based ELMo and BERT. Moreover, four standardisation methods, three distance measures and four clustering methods were evaluated. The analysis was performed on two corpora of texts in Polish classified into subjects. The Adjusted Mutual Information (AMI) metric was used to verify the quality of clustering results. The performed experiments show that Skipgram models with n-grams character embedding, built on KGR10 corpus and provided by Clarin-PL, outperforms other publicly available models for Polish. Moreover, presented results suggest that Yeo–Johnson transformation for document vectors standardisation and Agglomerative Clustering with a cosine distance should be used for grouping of text documents.