Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

Zijing Ou; Qinliang Su; Jianxing Yu; Ruihui Zhao; Yefeng Zheng; Bang Liu

doi:10.18653/v1/2021.findings-emnlp.203

Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

Zijing Ou, Qinliang Su, Jianxing Yu, Ruihui Zhao, Yefeng Zheng, Bang Liu

Abstract

Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOG), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOG or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed based on the mutual information (MI) maximization principle. Specifically, the method first constructs appropriate global and local codes from the documents and then seeks to maximize their mutual information. Experimental results on three benchmark datasets demonstrate that the proposed method is able to generate hash codes that outperform existing ones learned from BOG features by a substantial margin.

Anthology ID:: 2021.findings-emnlp.203
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2021
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: Findings
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2360–2369
Language:
URL:: https://aclanthology.org/2021.findings-emnlp.203/
DOI:: 10.18653/v1/2021.findings-emnlp.203
Bibkey:
Cite (ACL):: Zijing Ou, Qinliang Su, Jianxing Yu, Ruihui Zhao, Yefeng Zheng, and Bang Liu. 2021. Refining BERT Embeddings for Document Hashing via Mutual Information Maximization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2360–2369, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Refining BERT Embeddings for Document Hashing via Mutual Information Maximization (Ou et al., Findings 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.findings-emnlp.203.pdf
Video:: https://aclanthology.org/2021.findings-emnlp.203.mp4
Code: j-zin/dhim
Data: AG News

PDF Cite Search Code Video Fix data