BCTH: A Novel Text Hashing Approach via Bayesian Clustering

Ying Wenjie, Yuquan Le, Hantao Xiong


Abstract
Similarity search is to find the most similar items for a certain target item. The ability of similarity search at large scale plays a significant role in many information retrieval applications, and thus has received much attention. Text hashing is a promising strategy, which utilizes binary encoding to represent documents, obtaining attractive performance. This paper makes the first attempt to utilize Bayesian Clustering for Text Hashing, dubbed as BCTH. Specifically, BCTH is able to map documents to binary codes by utilizing multiple Bayesian Clusterings in parallel, where each Bayesian Clustering is responsible for one bit. Our approach employs the bit-balanced constraint to maximize the amount of information in each bit. Meanwhile, the bit-uncorrected constraint is adopted to keep the independence among all bits. The time complexity of BCTH is linear, where the hash codes and hash function are jointly learned. The experimental results, based on four widely-used datasets, demonstrate that BCTH is competitive, compared with currently competitive baselines in the perspective of both precision and training speed.
Anthology ID:
2020.aacl-main.7
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Editors:
Kam-Fai Wong, Kevin Knight, Hua Wu
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
54–62
Language:
URL:
https://aclanthology.org/2020.aacl-main.7
DOI:
Bibkey:
Cite (ACL):
Ying Wenjie, Yuquan Le, and Hantao Xiong. 2020. BCTH: A Novel Text Hashing Approach via Bayesian Clustering. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 54–62, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
BCTH: A Novel Text Hashing Approach via Bayesian Clustering (Wenjie et al., AACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.aacl-main.7.pdf
Dataset:
 2020.aacl-main.7.Dataset.pdf
Code
 myazi/semhash