Document Hashing with Multi-Grained Prototype-Induced Hierarchical Generative Model

Qian Zhang; Qinliang Su; Jiayang Chen; Zhenpeng Song

doi:10.18653/v1/2024.findings-emnlp.18

Document Hashing with Multi-Grained Prototype-Induced Hierarchical Generative Model

Qian Zhang, Qinliang Su, Jiayang Chen, Zhenpeng Song

Abstract

Document hashing plays a crucial role in large-scale information retrieval. However, existing unsupervised document hashing methods merely consider flat semantics of documents, resulting in the inability of preserving hierarchical semantics in hash codes. In this paper, we propose a hierarchical generative model that can model and leverage the hierarchical structure of semantics. Specifically, we introduce hierarchical prototypes into the model to construct a hierarchical prior distribution, which is integrated into the variational auto-encoder (VAE) framework, enabling the model to produce hash codes preserving rough hierarchical semantics. To further promote the preservation of hierarchical structure, we force the hash code to preserve as much semantic information as possible via contrastive learning, which exploits the hierarchical pseudo labels produced during VAE training. Extensive experiments on three benchmarks outperform all baseline methods, demonstrating the superiority of our proposed model on both hierarchical datasets and flat datasets.

Anthology ID:: 2024.findings-emnlp.18
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 321–333
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.18/
DOI:: 10.18653/v1/2024.findings-emnlp.18
Bibkey:
Cite (ACL):: Qian Zhang, Qinliang Su, Jiayang Chen, and Zhenpeng Song. 2024. Document Hashing with Multi-Grained Prototype-Induced Hierarchical Generative Model. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 321–333, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Document Hashing with Multi-Grained Prototype-Induced Hierarchical Generative Model (Zhang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.18.pdf
Software:: 2024.findings-emnlp.18.software.zip
Data:: 2024.findings-emnlp.18.data.zip

PDF Cite Search Software Data Fix data