Learn The Big Picture: Representation Learning for Clustering

Sumanta Kashyapi, Laura Dietz


Abstract
Existing supervised models for text clustering find it difficult to directly optimize for clustering results. This is because clustering is a discrete process and it is difficult to estimate meaningful gradient of any discrete function that can drive gradient based optimization algorithms. So, existing supervised clustering algorithms indirectly optimize for some continuous function that approximates the clustering process. We propose a scalable training strategy that directly optimizes for a discrete clustering metric. We train a BERT-based embedding model using our method and evaluate it on two publicly available datasets. We show that our method outperforms another BERT-based embedding model employing Triplet loss and other unsupervised baselines. This suggests that optimizing directly for the clustering outcome indeed yields better representations suitable for clustering.
Anthology ID:
2021.repl4nlp-1.15
Volume:
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Month:
August
Year:
2021
Address:
Online
Editors:
Anna Rogers, Iacer Calixto, Ivan Vulić, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, Vered Shwartz
Venue:
RepL4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
141–151
Language:
URL:
https://aclanthology.org/2021.repl4nlp-1.15
DOI:
10.18653/v1/2021.repl4nlp-1.15
Bibkey:
Cite (ACL):
Sumanta Kashyapi and Laura Dietz. 2021. Learn The Big Picture: Representation Learning for Clustering. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 141–151, Online. Association for Computational Linguistics.
Cite (Informal):
Learn The Big Picture: Representation Learning for Clustering (Kashyapi & Dietz, RepL4NLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.repl4nlp-1.15.pdf
Code
 nihilistsumo/blackbox_clustering