MIST: Mutual Information Maximization for Short Text Clustering

Krissanee Kamthawee, Can Udomcharoenchaikit, Sarana Nutanong


Abstract
Short text clustering poses substantial challenges due to the limited amount of information provided by each text sample. Previous efforts based on dense representations are still inadequate as texts are not sufficiently segregated in the embedding space before clustering. Even though the state-of-the-art method utilizes contrastive learning to boost performance, the process of summarizing all local tokens to form a sequence representation for the whole text includes noise that may obscure limited key information. We propose Mutual Information Maximization Framework for Short Text Clustering (MIST), which overcomes the information drown-out by including a mechanism to maximize the mutual information between representations on both sequence and token levels. Experimental results across eight standard short text datasets show that MIST outperforms the state-of-the-art method in terms of Accuracy or Normalized Mutual Information in most cases.
Anthology ID:
2024.acl-long.610
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11309–11324
Language:
URL:
https://aclanthology.org/2024.acl-long.610
DOI:
Bibkey:
Cite (ACL):
Krissanee Kamthawee, Can Udomcharoenchaikit, and Sarana Nutanong. 2024. MIST: Mutual Information Maximization for Short Text Clustering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11309–11324, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
MIST: Mutual Information Maximization for Short Text Clustering (Kamthawee et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.610.pdf