Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval

With the need of fast retrieval speed and small memory footprint, document hashing has been playing a crucial role in large-scale information retrieval. To generate high-quality hashing code, both semantics and neighborhood information are crucial. However, most existing methods leverage only one of them or simply combine them via some intuitive criteria, lacking a theoretical principle to guide the integration process. In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model. To deal with the complicated correlations among documents, we further propose a tree-structured approximation method for learning. Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones. Extensive experimental results on three benchmark datasets show that our method achieves superior performance over state-of-the-art methods, demonstrating the effectiveness of the proposed model for simultaneously preserving semantic and neighborhood information.


Introduction
Similarity search plays a pivotal role in a variety of tasks, such as image retrieval (Jing and Baluja, 2008;Zhang et al., 2018), plagiarism detection (Stein et al., 2007) and recommendation systems (Koren, 2008). If the search is carried out in the original continuous feature space directly, the requirements of computation and storage would be extremely high, especially for large-scale applications. Semantic hashing (Salakhutdinov and Hinton, 2009b) sidesteps this problem by learning a compact binary code for every item such that similar items can be efficiently found according to the Hamming distance of binary codes.
Unsupervised semantic hashing aims to learn for each item a binary code that can preserve the semantic similarity information of original items, without the supervision of any labels. Motivated by the success of deep generative models (Salakhutdinov and Hinton, 2009a;Kingma and Welling, 2013;Rezende et al., 2014) in unsupervised representation learning, many recent methods approach this problem from the perspective of deep generative models, leading to state-of-the-art performance on benchmark datasets. Specifically, these methods train a deep generative model to model the underlying documents and then use the trained generative model to extract continuous or binary representations from the original documents (Chaidaroon and Fang, 2017;Shen et al., 2018;Dong et al., 2019;Zheng et al., 2020). The basic principle behind these generative hashing methods is to have the hash codes retaining as much semantics information of original documents as possible so that semantically similar documents are more likely to yield similar codes.
In addition to semantics information, it is widely observed that neighborhood information among the documents is also useful to generate high-quality hash codes. By constructing an adjacency matrix from the raw features of documents, neighborbased methods seek to preserve the information in the constructed adjacency matrix, such as the locality-preserving hashing (He et al., 2004;Zhao et al., 2014), spectral hashing (Weiss et al., 2009;Li et al., 2012), and etc. However, since the groundtruth neighborhood information is not available and the constructed one is neither accurate nor complete, neighbor-based methods alone do not perform as well as the semantics-based ones. Despite both semantics and neighborhood information are derived from the original documents, different aspects are emphasized in them. Thus, to obtain higher-quality hash codes, it has been proposed to incorporate the constructed neighborhood information into semantics-based methods. For examples, Chaidaroon et al. (2018) and Hansen et al. (2020) require the hash codes can reconstruct neighboring documents, in addition to the original input. Other works (Shen et al., 2019;Hansen et al., 2019) use an extra loss term, derived from the approximate neighborhood information, to encourage similar documents to produce similar codes. However, all of the aforementioned methods exploit the neighborhood information by using it to design different kinds of regularizers to the original semanticsbased models, lacking a basic principle to unify and leverage them under one framework.
To fully exploit the two types of information, in this paper, we propose a hashing method that unifies the semantics and neighborhood information with the graph-driven generative models. Specifically, we first encode the neighborhood information with a multivariate Gaussian distribution. With this Gaussian distribution as a prior in a generative model, the neighborhood information can be naturally incorporated into the semantics-based hashing model. Despite the simplicity of the modeling, the correlation introduced by the neighbor-encoded prior poses a significant challenge to the training since it invalidates the widely used identical-andindependent-distributed (i.i.d.) assumption, making all documents correlated. To address this issue, we propose to use a tree-structured distribution to capture as much as possible the neighborhood information. We prove that under the tree approximation, the evidence lower bound (ELBO) can be decomposed into terms involving only singleton and pairwise documents, enabling the model to be trained as efficiently as the models without considering the document correlations. To capture more neighborhood information, a more accurate approximation by using multiple trees is also developed. Extensive experimental results on three public datasets demonstrate that the proposed method can outperform state-of-the-art methods, indicating the effectiveness of the proposed framework in unifying the semantic and neighborhood information for document hashing.

Preliminaries
Semantics-Based Hashing Due to the similarities among the underlying ideas of these methods, we take the variational deep semantic hashing (VDSH) (Chaidaroon and Fang, 2017) as an example to illustrate their working flow. Given a document x {w j } |x| j=1 , VDSH proposes to model a document by a generative model as where p(z) is the prior distribution and is chosen to be the standard Gaussian distribution N (z; 0, I d ), with I d denoting the d-dimensional identity matrix; and p θ (x|z) is defined to be in which w j denotes the |V |-dimensional one-hot representation of the j-th word, with |x| and |V | denoting the document and vocabulary size, respectively; and E ∈ R d×|V | represents the learnable embedding matrix. For a corpus containing N documents X = {x 1 , x 2 , · · · , x N }, due to the i.i.d. assumption for documents, it is modelled by simply multiplying individual document models as where Z [z 1 ; z 2 ; · · · ; z N ] denotes a long vector obtained by concatenating the individual vectors z i . The model is trained by optimizing the evidence lower bound (ELBO) of the log-likelihood function log p(X). After training, outputs from the trained encoder are used as documents' representations, from which binary hash codes can be obtained by thresholding the real-valued representations.
Neighborhood Information The ground-truth semantic similarity information is not available for the unsupervised hashing task in practice. To leverage this information, an affinity N × N matrix A is generally constructed from the raw features (e.g., the TFIDF) of original documents. For instances, we can construct the matrix as where a ij denotes the (i, j)-th element of A; and N k (x) denotes the k-nearest neighbors of document x. Given the affinity matrix A, some methods have been proposed to incorporate the neighborhood information into the semantics-based hashing models. However, as discussed above, these methods generally leverage the information based on some intuitive criteria, lacking theoretical supports behind them.

A Hashing Framework with Unified Semantics-Neighborhood Information
In this section, we present a more effective framework to unify the semantic and neighborhood information for the task of document hashing.

Reformulating the VDSH
To introduce the neighborhood information into the semantics-based hashing models, we first rewrite the VDSH model into a compact form as where p θ (X|Z) = N k=1 p θ (x k |z k ); and the prior p I (Z) = N k=1 p(z k ), which can be shown to be Here, ⊗ denotes the Kronecker product and the subscript I indicates independence among z k . The ELBO of this model can be expressed as where KL(·) denotes the Kullback-Leibler (KL) divergence. By restricting the posterior to independent Gaussian form the L 1 can be handled using the reparameterization trick. Thanks to the factorized forms assumed in q φ (Z|X) and p I (Z), the L 2 term can also be expressed analytically and evaluated efficiently.

Injecting the Neighborhood Information
Given an affinity matrix A, the covariance matrix I N +λA can be used to reveal the neighborhood information of documents, where the hyperparameter λ ∈ [0, 1) is used to control the overall correlation strength. If two documents are neighboring, then the corresponding correlation value in I N + λA will be large; otherwise, the value will be zero.
To have the neighborhood information reflected in document representations, we can require that the representations z i are drawn from a Gaussian distribution of the form where the subscript G denotes that the distribution is constructed from a neighborhood graph. To see why the representations Z ∼ p G (Z) have already reflected the neighborhood information, let us consider an example with three documents x 2 is connected to x 3 , and no connection exists between x 1 and x 3 . Under the case that z i is a two-dimensional vector z i ∈ R 2 , we have the concatenated representations [z 1 ; z 2 ; z 3 ] follow a Gaussian distribution with covariance matrix of From the property of Gaussian distribution, it can be known that z 1 is strongly correlated with z 2 on the corresponding elements, but not with z 3 . This suggests that z 1 should be similar to z 2 , but different from z 3 , which is consistent with the neighborhood relation that x 1 is a neighbor of x 2 , but not of x 3 . Now that the neighborhood information can be modeled by requiring Z being drawn from p G (Z), and the semantic information can be reflected in the likelihood function p θ (X|Z). The two types of information can be taken into account simultaneously by modeling the corpus as Comparing to the VDSH model in (6), it can be seen that the only difference lies in the employed priors. Here, a neighborhood-preserving prior p G (Z) is employed, while in VDSH, an independent prior p I (Z) is used. Although only a modification to the prior is made from the perspective of modeling, significant challenges are posed for the training. Specifically, by replacing p I (Z) with p G (Z) in the L 2 of L, it can be shown that the expression of L 2 involves the matrix (I N +λA) ⊗ I d −1 . Due to the introduced dependence among documents, for example, if the corpus contains over 100,000 documents and the representation dimension is set to 100, the L 2 involves the inverse of matrices with dimension as high as 10 7 , which is computationally prohibitive in practice.

Training with Tree Approximations
Although the prior p G (Z) captures the full neighborhood information, its induced model is not practically trainable. In this section, to facilitate the training, we first propose to use a tree-structured prior to partially capture the neighborhood information, and then extend it to multiple-tree case for more accurate modeling.

Approximating the Prior p G (Z) with a Tree-Structured Distribution
The matrix A represents a graph G (V, E), where V = {1, 2, · · · , N } is the set of document indices; and E = {(i, j)|a ij = 0} is the set of connections between documents. From the graph G, a spanning tree T = (V, E T ) can be obtained easily, where E T denotes the set of connections on the tree. 2 Based on the spanning tree, we construct a new distribution as where p G (z i ) and p G (z i , z j ) represent one-and two-variable marginal distributions of p G (Z), respectively. From the properties of Gaussian distribution, it is known that fined on a tree, as proved in (Wainwright and Jor dan, 2008), it is guaranteed to be a valid probability distribution, and more importantly, it satisfies the following two relations: That is, the tree-structured distribution p T (Z) captures the neighborhood information reflected on the spanning tree T. By using p T (Z) to replace p I (Z) of L 2 , it can be shown that L 2 can be expressed as the summation of terms involving only one or two variables, which can be handled easily. Due to the limitation of space, the concrete expression for the lower bound is given in the Supplementary Material.

Imposing Correlations on the Posterior
The posterior distribution q φ (Z|X) in the previous section is assumed to be in independent form, as the form shown in (8). But since a prior p T (Z) considering the correlations among documents is used, assuming an independent posterior is not appropriate. Hence, we follow the tree-structured prior and also construct a tree-structured posterior is the same as that in (8); and is also defined to be Gaussian, with its mean defined as [µ i ; µ j ] and covariance matrix defined as in which γ ij ∈ R d controls the correlation strength between z i and z j , whose elements are restricted in (−1, 1) and denotes the Hadamard product. By taking the correlated posterior q T (Z|X) into the ELBO, we obtain where we briefly denote the variational distribu- are all Gaussian distributions, the KL-divergence terms above can be derived in closed-form. Moreover, it can be seen that L T involves only single or pairwise variables, thus optimizing it is as efficient as the models without considering document correlation. With the trained model, hash codes can be obtained by binarizing the posterior mean µ i with a threshold, as done in (Chaidaroon and Fang, 2017). However, if without any constraint, the range of mean lies in (−∞, +∞). Thus, if we binarize it directly, lots of information in the original representations will be lost. To alleviate this problem, in our implementation, we parameterize the posterior mean µ i by a function of the form µ i = sigmoid(nn(x i )/τ ), where the outermost sigmoid function forces the mean to look like binary value and thus can effectively reduce the quantization loss, with nn(·) denoting a neural network function and τ controlling the slope of the sigmoid function.

Extending to Multiple Spanning Trees
Obviously, approximating the graph with a spanning tree may lose too much information. To alleviate this issue, we propose to capture the similarity information by a mixture of multiple distributions, with each built on a spanning tree. Specifically, we first construct a set of M spanning trees Based on the set of spanning trees, a mixturedistribution prior and posterior can be constructed as where p T (Z) and q T (Z|X) are the prior and posterior defined on the tree T , as done in (11) and (13). By taking the mixture distributions above into the ELBO of L to replace the prior and posterior, we can obtain a new ELBO, denoted as L MT . Obviously, it is impossible to obtain a closed-form expression for the bound L MT . But as proved in (Tang et al., 2019), by using the log-sum inequality, L MT can be further lower bounded by Given the expression of L T , the lower bound of L MT can also be expressed in closed-form and optimized efficiently. For detailed derivations and concrete expressions, please refer to the Supplementary.

Details of Modeling
The parameters µ i , µ j , σ i , σ j and γ ij in the approximate posterior distribution q φ (z i |x i ) of (8) and q φ (z i , z j |x i , x j ) of (13) are all defined as the outputs of neural networks, with the parameters denoted as φ. Specifically, the entire model is mainly composed of three components: i) The variational encoder q φ (z i |x i ), which takes single document as input, and outputs the mean and variance of Gaussian distribu- ii) The correlated encoder, which takes pairwise documents as input, and outputs the corre- iii) The generative decoder p θ (x i |z i ), which takes the latent variable z i as input and output the document x i . The decoder is modeled by a neural network parameterized by θ.
The model is trained by optimizing the lower bound L MT w.r.t. φ and θ. After training, hash codes are obtained by passing the documents through the variational encoder and binarizing the outputs on every dimension by a the threshold value, which is simply set as 0.5 in our experiments.
To intuitively understand the insight behind our model, an illustration is shown in Figure 1. We see that if the two documents are neighbors and semantically similar, the representations will be strongly correlated to each other. But if they are not semantically similar neighbors, the representations become less correlated. If they are neither neighbors nor semantically similar, the representations become not correlated at all. Since our model can simultaneously preserve semantics and neighborhood information, we name it as Semantics-Neighborhood Unified Hahing (SNUH).
Deep generative models (Rezende et al., 2014) have attracted a lot of attention in semanticsbased hashing, due to their successes in unsupervised representation learning. VDSH (Chaidaroon and Fang, 2017) first employed variational autoencoder (VAE) (Kingma and Welling, 2013) to learn continuous representations of documents and then casts them into binary codes. However, for the sake of information leaky problem during binarization step, such a two-stage strategy is prone to result in local optima and undermine the performance. NASH (Shen et al., 2018) tackled this issue by replacing the Gaussian prior with Bernoulli and adopted the straight-through technique (Bengio et al., 2013) to achieve end-to-end training. To further improve the model's capability, Dong et al. (2019) proposed to employ mixture distribution as a priori knowledge and Zheng et al. (2020) exploited Boltzmann posterior to introduce correlation among bits. Beyond generative frameworks, AMMI (Stratos and Wiseman, 2020) achieved superior performance by maximizing the mutual information between codes and documents. Nevertheless, the aforementioned semantic hashing methods are consistently under the i.i.d. assumption, which means they ignore the neighborhood information.
Spectral hashing (Weiss et al., 2009) and selftaught hashing (Zhang et al., 2010) are two typical methods of neighbor-based hashing models. But these algorithms generally ignore the rich semantic information associated with documents. Recently, some VAE-based models tried to concurrently take account of semantic and neighborhood information, such as NbrReg (Chaidaroon et al., 2018), RBSH (Hansen et al., 2019) and PairRec(Hansen et al., 2020). However, as mentioned before, all of them simply regarded the proximity as regularization, lacking theoretical principles to guide the incorporation process. Thanks to the virtue of graph-induced distribution, we effectively preserve the two types of information in a theoretical framework.

Experiment Setup
Datasets We verify the proposed methods on three public datasets which published by VDSH 3 : i) Reuters25178, which contains 10,788 news documents with 90 different categories; ii) TMC, which is a collection of 21,519 air traffic reports with 22 different categories; iii) 20Newsgroups (NG20), which consists of 18,828 news posts from 20 different topics. Note that the category labels of each dataset are only used to compute the evaluation metrics, as we focus on unsupervised scenarios. Training Details For fair comparisons, we follow the same network architecture used in VDSH, GMSH and CorrSH, using a one-layer feedforward neural network as the variational and the correlated encoder. The graph G is constructed with the K-nearest neighbors (KNN) algorithm based on cosine similarity on the TFIDF features of documents. In our experiments, the correlation strength coefficient λ in (12) is fixed to 0.99. According to the performance observed on the validation set, we choose the learning rate from {0.0005, 0.001, 0.003}, batch size from {32, 64, 128}, the temperature τ in sigmoid function from {0.1, 0.2, · · · , 1}, the number of trees M and neighbors K both form {1,2,. . . ,20}, with the best used for evaluation on the test set. The model is trained using the Adam optimizer (Kingma and Ba, 2014). More detailed experimental settings, along with the generating method of spanning trees, are given in the supplementary materials.

Evaluation Metrics
The retrieval precision is used as our evaluation metric. For each query document, we retrieve 100 documents most similar to it based on the Hamming distance of hash codes. Then, the retrieval precision for a single sample is measured as the percentage of the retrieved documents with the same label as the query. Finally, the average precision over the whole test set is calculated as the performance of the evaluated method.

Performance and Analysis
Overall Performance The performances of all the models on the three public datasets are shown in Table 1. We see that our model performs favorably

Method
Reuters TMC 20Newsgroups Avg 16bits 32bits 64bits 128bits 16bits 32bits 64bits 128bits 16bits 32bits 64bits 128bits   to the current state-of-the-art method, yielding best average performance across different datasets and settings. Compared with VDSH and NASH, which simply employ isotropic Gaussian and Bernoulli prior, respectively, we can observe that our model, which leverages correlated prior and posterior distributions, achieves better results on all the three datasets. Although GMSH improves performance by exploiting a more expressive Gaussian mixture prior, our model still outperforms it by a substantial margin, indicating the superiority of incorporating document correlations. It is worth noting that, by unifying semantics and neighborhood information under the generative models, the two types of information can be preserved more effectively. This can be validated by that our model performs significantly better than NbrReg, which naively incorporates the neighborhood information by using a neighbor-reconstruction regularizer. The superiority of our unified method can be further corroborated in the comparisons with RBSH and PairRec, which are given in the Supplementary since they employed a different preprocessing method as the models reported here. Comparing to the current SOTA methods of AMMI and CorrSh, our method is still able to achieve better results by exploiting the correlation among documents. Moreover, thanks to the benefit of correlation regularization, remarkable gratuity can be acquired profitably in 64 and 128 bits.

Impact of Introducing Correlations in Prior and Posterior
To understand the influences of the proposed document-correlated prior and posterior, we further experiment with two variants of our model: i) SNUH ind : which does not consider document correlations in neither the prior nor the posterior distribution; ii) SNUH prior : which only considers the correlations in the prior, but not in the posterior. Obviously, the proposed SNUH represents the method that leverage the correlations in both of the prior and posterior. As seen from Table 2, SNUH prior achieves better performance than SNUH ind , demonstrating the benefit of considering the correlation information of documents only in the prior. By further taking the correlations into account in the posterior, improvements of SNUH can be further observed, which fully corroborates the superiority of considering document correlations in the prior and posterior. Another interesting observation is that the performance gap be-  tween SNUH ind and SNUH prior becomes small as the length of bits increases. This may be attributed to the fact that the increased generalization ability of models brought by large bits is inclined to alleviate the impact of priori knowledge. However, by additionally incorporating correlation constraints on posterior, significant performance gains would be obtained, especially in large bits scenarios.
Effect of Spanning Trees For more efficient training, spanning trees are utilized to approximate the whole graph by dropping out some edges. To understand its effects, we first investigate the impact of the number of trees. The first row of Figure  2 shows the performance of our method as a function of different numbers of spanning trees. We observe that, compared to not using any correlation, one tree alone can bring significant performance gains. As the tree number increases, the performance rises steadily at first and then converges into a certain level, demonstrating that the document correlations can be mostly captured by several spanning trees. Then, we further explore the impact of the neighbor number when constructing the graphs using the KNN method, as shown in the second row of Figure 2. It can be seen that more neighbors contributes to better performance. We hypothesize that this is partly due to the more diverse correlation information captured by the increasing number of neighbors. However, incorporating too many neighbors may lead to the problem of introducing noise and incorrect correlation information to the hash codes. That explains why no further improvement is observed after the number reaches a level.

Empirical Study of Computational Efficiency
We also investigate the training complexity by comparing the training duration of our method and VDSH, on Tesla V100-SXM2-32GB.  VDSH in 2.038s, 4.364s, 1.051s. It can be seen that our model, though with much stronger performance, can be trained almost as efficiently as vanilla VDSH due to the tree approximations.

Case Study
In Table 3, we present a retrieval case of the given query document. It can be observed that as the Hamming distance increases, the semantic (topic) of the retrieved document gradually becomes more irrelevant, illustrating that the Hamming distance can effectively measure the document relevance.

Visualization of Hash Codes
To evaluate the quality of generated hash code more intuitively, we project the latent representations into a 2dimensional plane with the t-SNE (van der Maaten and Hinton, 2008) technique. As shown in Figure  3, the representations generated by our method are more separable than those of AMMI, demonstrating the superiority of our method.

Conclusion
We have proposed an effective and efficient semantic hashing method to preserve both the semantics and neighborhood information of documents. Specifically, we applied a graph-induced Gaussian prior to model the two types of information in a unified framework. To facilitate training, a treestructure approximation was further developed to decompose the ELBO into terms involving only singleton or pairwise variables. Extensive evaluations demonstrated that our model significantly outperforms baseline methods by incorporating both the semantics and neighborhood information.

Appendices A Derivation of Formulas
Derivation of KL (q φ (Z|X)||p T (Z)) In the main paper, we propose a tree-type distribution to introduce partial neighborhood information so that the L 2 term can be expressed as the summation over terms involving only one or two variables.
Here, we provide the detail derivation.
Obviously, the KL divergence is decomposed into the terms involving singleton and pairwise variables, which can be calculated efficiently.

Algorithm 1 Model Training Algorithm
Input: Document representations X; edges list of spanning trees E; batch size b. Output: Optimal parameters (θ, φ).

Variational Enc Correlated Enc
Encoder Then, we can express L T in an analytical form Derivation of L MT With L MT , we extend the single-tree approximation to multi-tree approximation. Although the KL divergence between the mixture distributions does not have a closed-form solution, we can obtain its explicit upper bound by using the log-sum inequality as We can further express L MT in a more intuitive form as denotes the proportion of times that the edge (i, j) appears. To optimize this objective, we can construct an estimator of the ELBO, based on the minibatch where V M is the subset of documents, E M T is the subset of edges and Then we can update the parameters by using the gradient ∇ φ,θ L M MT . The training procedure is summarized in Algorithm 1.

B Tree Generation Algorithm
Algorithm 2 shows the spanning tree generation algorithm TreeGen(·) used in our graph-induced generative document hashing model. TreeGen(·) utilizes a depth-first search (DFS) algorithm to generate meaningful neighborhood information for each node. In this algorithm, RC [·] means randomly  choosing one index according to the indicator function; ID [·] represents the set of node indexes satisfying the indicator condition and N (i) denotes the neighbors of node i. Due to the importance of edges precision, when choosing a neighbor (line 16 in Algorithm 2), instead of using uniform sampling, we exploit a temperature α to control the trade-off between the precision and diversity of edges. Specifically, the probability of sampling neighbor j of node i is We find the best configuration of α on the validation set with the values in {0.1, 0.2, · · · , 1} .

C Experiment Details
For fair comparisons, we follow the experimental setting of VDSH. Specifically, the vocabulary size |V | is 7164, 20000, and 10000 for Reuters, TMC and 20Newsgroups, respectively. The split of training, validation, and test set is as follows: 7752, 967, 964 for Reuters; 21286, 3498, 3498 for TMC and 11016, 3667, 3668 for 20Newsgroups, respectively. Moreover, the KL term in Eq. (18) of the main paper is weighted with a coefficient β to avoid posterior collapse. We find the best configuration of β on the validation set with the values in {0.01, 0.02, · · · , 0.1}. To intuitively understand our model, we illustrate the whole architecture in Table 4.

D Additional Experiments
Comparing with RBSH and PairRec As mentioned before, the reason we do not directly compare our method with RBSH (Hansen et al., 2019) and PairRec (Hansen et al., 2020) is that their data processing methods are different from the mainstream methods (e.g., VDSH, NASH, GMSH, Nbr-Reg, AMMI and CorrSH). To further compare our method with them, we evaluate our model on three datasets that are published by RBSH 4 . The results are illustrated in Table 5. We observe that our method achieves the best performances in most experimental settings, which further confirms the superiority of simultaneously preserving the semantics and similarity information in a more principled framework.
Parameter Sensitivity To understand the robustness of our model, we conduct a parameter Sensitivity analysis of τ and β in Figure 4. Compared with β = 0 (without using neighborhood information), models with β = 0 improve performance significantly, but gradually performs steadily as β getting larger, which once again confirms the importance of simultaneously modeling semantic and neighborhood information. As for temperature coefficient τ used in variational encoder, our model performs steadily with various values of τ in the Reuters dataset. But in TMC and 20Newsgroups, increasing τ would deteriorate the model performance. Generally speaking, the model can achieve better performance with smaller τ (i.e., steeper sigmoid function). As we utilize 0.5 as the threshold value, steeper sigmoid functions make it easier to distinguish hash codes.