HS-GC: Holistic Semantic Embedding and Global Contrast for Effective Text Clustering

Chen Yang; Bin Cao; Jing Fan

HS-GC: Holistic Semantic Embedding and Global Contrast for Effective Text Clustering

Abstract

In this paper, we introduce Holistic Semantic Embedding and Global Contrast (HS-GC), an end-to-end approach to learn the instance- and cluster-level representation. Specifically, for instance-level representation learning, we introduce a new loss function that exploits different layers of semantic information in a deep neural network to provide a more holistic semantic text representation. Contrastive learning is applied to these representations to improve the model’s ability to represent text instances. Additionally, for cluster-level representation learning we propose two strategies that utilize global update to construct cluster centers from a global view. The extensive experimental evaluation on five text datasets shows that our method outperforms the state-of-the-art model. Particularly on the SearchSnippets dataset, our method leads by 4.4% in normalized mutual information against the latest comparison method. On the StackOverflow and TREC datasets, our method improves the clustering accuracy of 5.9% and 3.2%, respectively.

Anthology ID:: 2024.lrec-main.732
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 8349–8359
Language:
URL:: https://aclanthology.org/2024.lrec-main.732
DOI:
Bibkey:
Cite (ACL):: Chen Yang, Bin Cao, and Jing Fan. 2024. HS-GC: Holistic Semantic Embedding and Global Contrast for Effective Text Clustering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8349–8359, Torino, Italia. ELRA and ICCL.
Cite (Informal):: HS-GC: Holistic Semantic Embedding and Global Contrast for Effective Text Clustering (Yang et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.732.pdf

PDF Cite Search