Controllable Clustering with LLM-driven Embeddings

Kerria Pang-Naylor; Shivani Manivasagan; Aitong Zhong; Mehak Garg; Nicholas Mondello; Blake Buckner; Jonathan P. Chang; Khyati Mahajan; Masoud Hashemi; Fabio Casati

doi:10.18653/v1/2025.emnlp-industry.48

Controllable Clustering with LLM-driven Embeddings

Kerria Pang-Naylor, Shivani Manivasagan, Aitong Zhong, Mehak Garg, Nicholas Mondello, Blake Buckner, Jonathan P. Chang, Khyati Mahajan, Masoud Hashemi, Fabio Casati

Abstract

Given the inherent subjectivity of similarity in text, fully unsupervised text clustering is unlikely to produce groupings that work across a variety of use cases. Traditional techniques to guide clustering rely on costly, time-consuming human feedback and/or pre-existing labels. Leveraging recent advancements in LLMs and decoder-only embedding models, we present techniques to effectively control text embeddings with minimal human input: prefix instructions and LLM preprocessing. We evaluate clustering performance for datasets with multiple independent ground-truth labels, or perspectives, and find that these techniques can be used to improve clustering for one perspective or use case, at the cost of a tradeoff in performance for another use case.

Anthology ID:: 2025.emnlp-industry.48
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 686–702
Language:
URL:: https://aclanthology.org/2025.emnlp-industry.48/
DOI:: 10.18653/v1/2025.emnlp-industry.48
Bibkey:
Cite (ACL):: Kerria Pang-Naylor, Shivani Manivasagan, Aitong Zhong, Mehak Garg, Nicholas Mondello, Blake Buckner, Jonathan P. Chang, Khyati Mahajan, Masoud Hashemi, and Fabio Casati. 2025. Controllable Clustering with LLM-driven Embeddings. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 686–702, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: Controllable Clustering with LLM-driven Embeddings (Pang-Naylor et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-industry.48.pdf

PDF Cite Search Fix data