Evaluation of vector embedding models in clustering of text documents

Tomasz Walkowiak, Mateusz Gniewkowski


Abstract
The paper presents an evaluation of word embedding models in clustering of texts in the Polish language. Authors verified six different embedding models, starting from widely used word2vec, across fastText with character n-grams embedding, to deep learning-based ELMo and BERT. Moreover, four standardisation methods, three distance measures and four clustering methods were evaluated. The analysis was performed on two corpora of texts in Polish classified into subjects. The Adjusted Mutual Information (AMI) metric was used to verify the quality of clustering results. The performed experiments show that Skipgram models with n-grams character embedding, built on KGR10 corpus and provided by Clarin-PL, outperforms other publicly available models for Polish. Moreover, presented results suggest that Yeo–Johnson transformation for document vectors standardisation and Agglomerative Clustering with a cosine distance should be used for grouping of text documents.
Anthology ID:
R19-1149
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1304–1311
Language:
URL:
https://aclanthology.org/R19-1149/
DOI:
10.26615/978-954-452-056-4_149
Bibkey:
Cite (ACL):
Tomasz Walkowiak and Mateusz Gniewkowski. 2019. Evaluation of vector embedding models in clustering of text documents. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1304–1311, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Evaluation of vector embedding models in clustering of text documents (Walkowiak & Gniewkowski, RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1149.pdf