Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Hanlei Zhang, Hua Xu, Fei Long, Xin Wang, Kai Gao


Abstract
Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample’s nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-K parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.
Anthology ID:
2024.acl-long.2
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18–35
Language:
URL:
https://aclanthology.org/2024.acl-long.2
DOI:
10.18653/v1/2024.acl-long.2
Bibkey:
Cite (ACL):
Hanlei Zhang, Hua Xu, Fei Long, Xin Wang, and Kai Gao. 2024. Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18–35, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances (Zhang et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.2.pdf