Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization

Efficient document retrieval heavily relies on the technique of semantic hashing, which learns a binary code for every document and employs Hamming distance to evaluate document distances. However, existing semantic hashing methods are mostly established on outdated TFIDF features, which obviously do not contain lots of important semantic information about documents. Furthermore, the Hamming distance can only be equal to one of several integer values, significantly limiting its representational ability for document distances. To address these issues, in this paper, we propose to leverage BERT embeddings to perform efficient retrieval based on the product quantization technique, which will assign for every document a real-valued codeword from the codebook, instead of a binary code as in semantic hashing. Specifically, we first transform the original BERT embeddings via a learnable mapping and feed the transformed embedding into a probabilistic product quantization module to output the assigned codeword. The refining and quantizing modules can be optimized in an end-to-end manner by minimizing the probabilistic contrastive loss. A mutual information maximization based method is further proposed to improve the representativeness of codewords, so that documents can be quantized more accurately. Extensive experiments conducted on three benchmarks demonstrate that our proposed method significantly outperforms current state-of-the-art baselines.


Introduction
In the era of big data, Approximate Nearest Neighbor (ANN) search has attracted tremendous attention thanks to its high search efficiency and extraordinary performance in modern information retrieval systems.By quantizing each document as a compact binary code, semantic hashing (Salakhutdinov and Hinton, 2009) has become the main solution to ANN search due to the extremely low cost of calculating Hamming distance between binary codes.One of the main approaches for unsupervised semantic hashing methods is established on generative models (Chaidaroon and Fang, 2017;Shen et al., 2018;Dong et al., 2019;Zheng et al., 2020), which encourage the binary codes to be able to reconstruct the input documents.Alternatively, some other methods are driven by graphs (Weiss et al., 2008;Chaidaroon et al., 2020;Hansen et al., 2020;Ou et al., 2021a), hoping the binary codes can recover the neighborhood relationship.Though these methods have obtained great retrieval performance, there still exist two main problems.
Firstly, these methods are mostly established on top of the outdated TFIDF features, which do not contain various kinds of important information of documents, like word order, contextual information, etc.In recent years, pre-trained language models like BERT have achieved tremendous success in various downstream tasks.Thus, a natural question to ask is whether we can establish efficient retrieval methods on BERT embeddings.However, it has been widely reported that BERT embeddings are not suitable for semantic similarity-related tasks (Reimers and Gurevych, 2019), which perform even worse than the traditional Glove embeddings (Pennington et al., 2014).(Ethayarajh, 2019;Li et al., 2020) attribute this to the "anisotropy" phenomenon that BERT embeddings only occupy a narrow cone in the vector space, causing the semantic information hidden in BERT embeddings not easy to be leveraged directly.Thus, it is important to investigate how to effectively leverage the BERT embeddings for efficient document retrieval.
Secondly, to guarantee the efficiency of retrieval, most existing methods quantize every document to a binary code via semantic hashing.There is no doubt that the Hamming distance can improve the retrieval efficiency significantly, but it also restricts the representation of document similarities seriously since it can only be an integer from −B to B for B-bit codes.Recently, an alternative approach named product quantization (Jégou et al., 2011;Jang and Cho, 2021;Wang et al., 2021) has been proposed in the computer vision community to alleviate this problem.Basically, it seeks to quantize every item to one of the codewords in a codebook, which is represented by a Cartesian product of multiple small codebooks.It has been shown that product quantization is able to deliver superior performance than semantic hashing while keeping the cost of computation and storage relatively unchanged.However, this technique has rarely been explored in unsupervised document retrieval.
Motivated by the two problems above, in this paper, we propose an end-to-end contrastive product quantization model to jointly refine the original BERT embeddings and quantize the refined embeddings into codewords.Specifically, we first transform the original BERT embeddings via a learnable mapping and feed the transformed embedding into a probabilistic product quantization module to output a quantized representation (codeword).To preserve as much semantic information as possible in the quantized representations, inspired by recent successes of contrastive learning, a probabilistic contrastive loss is designed and trained in an end-to-end manner, simultaneously achieving the optimization of refining and quantizing modules.Later, to further improve the retrieval performance, inspired by the recent development of clustering, a mutual information maximization based method is further developed to increase the representativeness of learned codewords.By doing so, the cluster structure hidden in the dataset of documents could be kept soundly, making the documents be quantized more accurately.Extensive experiments are conducted on three real-world datasets, and the experimental results demonstrate that our proposed method significantly outperforms current state-ofthe-art baselines.Empirical analyses also demonstrate the effectiveness of every proposed component in our proposed model.

Preliminaries of Product Quantization for Information Retrieval
In fields of efficient information retrieval, a prevalent approach is semantic hashing, which maps every item x to a binary code b and then uses Hamming distances to reflect the semantic similarity of items.Thanks to the low cost of computing Hamming distance, the retrieval can be performed very efficiently.However, the Hamming distance can only be an integer from −B to B for B-bit codes, which is too restrictive to reflect the rich similarity information.An alternative approach is vector quantization (VQ) (Gray and Neuhoff, 1998), which assigns every item with a codeword from a codebook C. The codeword in VQ could be any vector in R D , rather than limited to the binary form as in semantic hashing.By storing pre-computed distances between any two codewords in a table, the distance between items can be obtained efficiently by looking up the table.However, to ensure competitive performance, the number of codewords in a codebook needs to be very large.For example, there are 2 64 different codes for a 64-bit binary code, and thus the number of codewords in VQ should also be of this scale, which however is too large to be handled.
To tackle this issue, product quantization (Jégou et al., 2011) proposes to represent the codebook C as a Cartesian product of M small codebooks where the m-th codebook C m consists of K codewords {c m k } K k=1 with c m k ∈ R D/M .For an item, the product quantization will choose a codeword from every codebook C m , and the final codeword assigned to this item is where c m denotes the codeword chosen from C m , and • denotes concatenation.For each codeword c, we only need to record its indices in the M codebooks, which only requires M log 2 K bits.Thanks to the Cartesian product decomposition, now we only need to store M K codewords of dimension R D/M and M lookup tables of size K × K.As an example, to enable a total of 2 64 codewords, we can set M = 32 and K = 4, which obviously will reduce the size of footprint and lookup tables significantly.During retrieval, we only need to look up the M tables and sum them up, which is only slightly more costly than the computation of Hamming distance.
3 The End-to-End Joint Refining and Quantizing Framework To retrieve semantic-similar documents, a core problem is how to produce for every document a quantized representation that preserves the semantic-similarity information of original documents.In this section, a simple two-stage method is first proposed, and then we develop an end-to-end method that is able to directly output representations with desired properties.

A Simple Two-Stage Approach
To obtain semantics-preserving quantized representations, a naive approach is to first promote the semantic information in original BERT embeddings and then quantize them.Many methods have been proposed on how to refine BERT embeddings to promote their semantic information.These methods can be essentially described as transforming the original embedding z(x) into another one z(x) via a mapping g(•) as where g(•) could be a flow-based mapping (Li et al., 2020), or a neural network trained to maximize the agreement between representations of a document's two views (Gao et al., 2021), etc.It has been reported that the refined embeddings z(x) are semantically much richer than original one z(x).
Then, we can follow standard product quantization procedures to quantize the refined embeddings z(x) into discrete representations.

End-to-End Refining and Quantizing via Contrastive Product Quantization
Obviously, the separation between the refining and quantizing steps in the two-stage method could result in a significant loss of semantic information in the final quantized representation.To address this issue, an end-to-end refining and quantizing method is proposed.We first slice the original BERT embedding z(x) into M segments z m (x) ∈ R D/M for m = 1, 2, • • • , M .Then, we refine z m (x) by transforming it into a semanticricher form zm (x) via a mapping g m θ (•), that is, where the subscript θ denotes the learnable parameter.Different from the mapping g(•) which is determined at the refining stage and is irrelevant to the quantization in the two-stage method, the mapping g m θ (•) here will be learned later by taking the influences of quantization error into account.Now, instead of quantizing the refined embedding zm (x) to a fixed codeword, we propose to quantize it to one of the codewords {c m k m } K k m =1 by stochastically selecting k m according to the distribution Obviously, the probability that zm (x) is quantized to a codeword is inversely proportional to their distance.Thus, by denoting k m as a random sample drawn from p(k m |x), i.e., k m ∼ p(k m |x), we can represent the m-th quantized representation of document x as and the whole quantized representation of x as Note that the quantized representation h(x) depends on random variables itself is also random.Now, we seek to preserve as much semantic information as possible in the quantized representation h(x).Inspired by the recent successes of contrastive learning in semantic-rich representation learning (Gao et al., 2021), we propose to minimize the contrastive loss.Specifically, for every document x, we first obtain two BERT embeddings by passing it through BERT two times with two independent dropout masks and then use the embeddings to generate two quantized representations h (1) (x) and h (2) (x) according to ( 6) and ( 7).Then, we define the contrastive loss as where B denotes a mini-batch of training documents; and ℓ (i) (x) for i = 1, 2 is defined as x , h x ) S(h x denoting the abbreviation of h (1) (x) for conciseness; and S(h 1 , h 2 ) is defined as being the cosine similarity function, and τ cl being a temperature hyperparamter.
Under the proposed quantization method above, the quantized representation h(x) depends on the random variable k m ∼ p(k m |x), making it not deterministic w.r.t. a given document x.Thus, we do not directly optimize the random contrastive loss L cl , but minimize its expectation where Obviously, it is impossible to derive an analytical expression for ℓ (i) (x), making the optimization of L cl not feasible.To address this issue, it has been proposed in (Jang et al., 2017) where ξ i denote i.i.d.random samples drawn from the Gumbel distribution Gumbel(0, 1).Then, using the softmax function to approximate the arg max(•), the m-th quantized representation h m (x) can be approximately represented as where v ∈ R K is a probability vector whose k-th element is with τ being a hyper-parameter.It can be easily seen that h m (x) will converge to h m (x) as τ → 0, thus hm (x) can be used as a good approximation to h m (x).With the approximation, ℓ (i) (x) in ( 12) can be approximately written as x , h x ) S( h where h (1) x is the abbreviation of h (1) (x).Substituting ( 16) into (11) gives an approximate analytical expression of L cl that is differentiable w.r.t. the parameter θ for refining and codebooks {C m } M m=1 for quantization.Therefore, we can optimize the θ and codebooks {C m } M m=1 in an end-to-end manner, explicitly encouraging the quantized representations h(x) to preserve more semantic information.
It is worth noting that the injected Gumbel noise ξ i in ( 15) is important to yield a sound approximation ℓ (i) (x) in ( 16).Theoretically, the approximated ℓ (i) (x) is guaranteed to converge to the exact value when τ → 0 and a large number of independent Gumbel nosies are used to approximate the expectation.However, if we abandon this noise in the computation of v k , we will lose the appealing property above.Our experimental results also demonstrate the advantages of injecting Gumbel noises in the approximation.Another point worth pointing out is that if the refined embedding zm (x) is quantized to the closest codeword deterministically, that is, letting , then it becomes equivalent to our probabilistic quantization approach without using Gumbel noise, further demonstrating the advantages of our method.

Improving the Representativeness of Codewords via MI Maximization
It can be seen that the codewords in C m work similarly to the cluster centers in clustering.The clustering centers are known to be prone to get stuck at suboptimal points, which applies to the codewords analogously.If the codewords are not representative enough, they could result in a significant loss of semantic information in the quantized representations (Ge et al., 2013;Cao et al., 2016).It has been recently observed that maximizing mutual information (MI) between the data and the cluster assigned to it can often lead to much better clustering performance (Hu et al., 2017;Ji et al., 2019;Do et al., 2021).Inspired by this, to increase the representativeness of codewords, we also propose to maximize the MI between the original document x and the codeword (index) assigned to it.To this end, given the conditional distribution p(k m |x) in ( 5), we first estimate the marginal distribution of codeword index k m as where D denotes the training dataset.Then, by definition, the entropy of the codeword index k m can be estimated as Similarly, the conditional entropy of codeword k m given data x can be estimated as Now, the mutual information between codeword index k m and data x can be easily obtained by definition as In practice, we find that it is better not to directly maximize the MI I(X, K m ), but to maximize its variant form where α is a non-negative hyper-parameter controlling the trade-off between two entropy terms.
Intuitively, maximizing the MI can be understood as encouraging only one codeword is assigned a high probability for a document x, while all codewords are used evenly overall.Given the mutual information, the final training objective becomes where λ is a hyper-parameter controlling the relative importance of the MI term.Since this method employs MI to improve the quality of codewords, we name the model as MICPQ, i.e., Mutual-Information-Improved Contrastive Product Quantization.

Related Work
As the main solution to efficient document retrieval, unsupervised semantic hashing has been studied for years.Many existing unsupervised hashing methods are established on the generative models encouraging the binary codes to reconstruct the original document.For example, VDSH (Chaidaroon and Fang, 2017) proposes a two-stage scheme, in which it first learns the continuous representations under the VAE (Kingma and Welling, 2014) framework, and then cast them into the binary codes.To tackle the two-stage training issue, NASH (Shen et al., 2018) presents an end-to-end generative hashing framework where the binary codes are treated as Bernoulli latent variables, and introduces the Straight-Through (Bengio et al., 2013) estimator to estimate the gradient w.r.t. the discrete variables.Dong et al. (2019) employs the mixture priors to empower the binary code with stronger expressiveness, therefore resulting in better performance.Further, CorrSH (Zheng et al., 2020) employs the distribution of the Boltzmann machine to introduce correlations among the bits of binary codes.Ye et al. (2020) proposes the auxiliary implicit topic vectors to address the issue of information loss in the few-bit scenario.Also, a handful of recent works focus on how to inject the neighborhood information of the graph under the VAE framework.Chaidaroon et al. (2018); Hansen et al. (2020) seeks to learn the optimal binary codes that can reconstruct neighbors of original documents.A ranking loss is introduced in (Hansen et al., 2019) to accurately characterize the correlation between documents.Ou et al. (2021a) first proposes to integrate the semantics and neighborhood information with a graph-driven generative model.
Beyond generative hashing methods, studies on hashing via the mutual information (MI) principle emerges recently.AMMI (Stratos and Wiseman, 2020) learns a high-quality binary code by maximizing the MI between documents and binary codes.DHIM (Ou et al., 2021b) first compresses the BERT embeddings into binary codes by maximizing the MI between global codes and local codes from documents.
Another efficient retrieval mechanism is product quantization.The Product quantization (PQ) (Jégou et al., 2011) and its improved variants such as Optimized PQ (Ge et al., 2013) and Locally Optimized PQ (Kalantidis and Avrithis, 2014) are proposed to retrain richer distance information than hashing methods while conducting the retrieval efficiently.These shallow unsupervised PQ methods are often based on the well-trained representation and learn the quantization module with heuristic algorithms (Xiao et al., 2021), which often can not achieve satisfactory performance.In this paper, we propose an unsupervised MI-improved end-to-end unsupervised product quantization model MICPQ.We notice that the proposed MICPQ is somewhat similar to recent works w.r.t PQ (Jang and Cho, 2021;Wang et al., 2021) in the computer vision community.However, (Jang and Cho, 2021) focuses on analyzing the performance difference between different forms of contrastive losses, while we focus on how to design an end-to-end model to jointly refine and quantize the BERT embedding via contrastive product quantization from a probabilistic perspective.(Wang et al., 2021) concentrates on how to improve the codeword diversity to prevent model degeneration by regularizer design; whereas there is no "model degeneration" phenomenon observed in our model, and we use the mutual information maximization based method to increase the representativeness of codewords to further improve the retrieval performance.

Datasets, Evaluation and Baselines
Datasets The proposed MICPQ model is evaluated on three benchmark datasets, including NYT (Tao et al., 2018), AGNews (Zhang et al., 2015) and DBpedia (Lehmann et al., 2015).Details of the three datasets can be found in Appendix A.
Evaluation Metrics For every query document from the testing dataset, we retrieve its top-100 most similar documents from the training set with the Asymmetric distance computation (Jégou et al., 2011) which is formulized in Appendix B. Then the retrieval precision is calculated as the ratio of the relevant documents.Note that a retrieved document is considered relevant to the query if they both share the same label.Finally, the retrieval precision averaged over all test documents is reported.
AEPQ.We are interested in optimizing our product quantizer with the reconstructing objective.Specifically, the quantized semantics-preserving representation vector h(x) is expected to reconstruct the original input features (i.e., BERT embeddings) with a newly added decoder, which is similar with previous generative hashing methods.Same as MICPQ, the performance of AEPQ is evaluated with the Asymmetric distance computation.We name this baseline as AEPQ, i.e., Auto-Encoderbased Product Quantization.
CSH. Same as the encoder setting in NASH (Shen et al., 2018), we assume that the binary codes are generated by sampling from the multivariate Bernoulli distribution and propose to minimize the expected contrastive loss w.r.t. the discrete binary codes directly.The straight-through (ST) gradient estimator (Bengio et al., 2013) is utilized for training in an end-to-end manner.The data augmentation strategy is the same as that of our MICPQ.We name the proposed baseline as CSH, i.e., Contrastive Semantic Hashing.
Training Details In our MICPQ, to output the refined vectors that will be fed into the product quantizer with the desired dimension, the encoder network is constituted by a pre-trained BERT Module followed by an one-layer ReLU feedforward neural network on top of the [CLS] representation whose dimension is 768.For performances under the average pooling setting of BERT, please refer to Appendix C.During the training, following the setting in DHIM (Ou et al., 2021b), we fix the parameters of BERT, while only training the newly added parameters.We implement the proposed model with PyTorch and employ the Adam optimizer(Kingma and Ba, 2015) for optimization.
In terms of hyper-parameters relevant to the product quantization, the dimension of codeword D M in the small codebook C m is fixed to 24 and the number of codewords K in each codebook C m is fixed to 16.By setting the number of small codebooks M as {4, 8, 16, 32}, we can see that the final codeword in the codebook C can be represented by {16, 32, 64, 128} bits according to B = M log 2 K. Thus, when compared with semantic hashing, they are compared under the same number of bits.
For other hyper-parameters, the learning rate is set as 0.001; the dropout rate p drop to generate positive pairs for contrastive learning is set as 0.3; the Gumbel-Softmax temperature τ is set as 10 for 16-bit binary codes and 5 for longer codes; the temperature τ cl in contrastive learning is set as 0.3; the trade-off coefficient α in ( 20) is set as 0.1; the coefficient λ in ( 21) is chosen from {0.1, 0.2, 0.3} according to the performance on the validation set.

Overall Performance
Table 1 presents performances of our proposed model and existing baselines on three public datasets with code lengths varying from 16 to 128.It can be seen that the proposed simple baseline CSH achieves promising performances across all three datasets and nearly all code length settings when compared to previous state-of-the-art methods, demonstrating the superiority of using contrastive learning to promote semantic information.Further, our proposed MICPQ outperforms the previous methods by a more substantial margin.Specifically, improvements of 4.32%, 3.07% and 4.37% averaged over all code lengths are observed on NYT, AGNews and DBpedia datasets, respectively, when compared with the current stateof-the-art DHIM.Moreover, the performance of AEPQ lags behind our proposed MICPQ remarkably, which illustrates the limitation of the reconstructing objective.It is also observed that the retrieval performance of our proposed MICPQ consistently improves as the code length increases.Although this is consistent with our intuition that a longer code can preserve more semantic information, it does not always hold in some previous models (e.g., NASH, BMSH, DHIM).

Abalation Study
To understand the influence of different components in the MICPQ, we further evaluate the retrieval performance of two variants of MICPQ.
(i) MICPQ cl : it removes the mutual-information term I(X, K m ) in each codebook and only optimizes the quantized contrastive loss to learn the  2, when compared to MICPQ cl , our MICPQ improves the retrieval performance averaged over all code lengths by 1.51% and 0.94% on NYT and AGNews respectively, demonstrating the effectiveness of our mutual-information term inside each codebook.Also, by comparing MICPQ to MICPQ softmax , consistent improvements can be observed on both datasets, which demonstrates the superiority of the proposed probabilistic product quantization.

Link to Semantic Hashing: A Special Form of MICPQ
Through an extreme setting, the MICPQ model can be reconsidered in the semantic hashing framework and be evaluated using the Hamming distance rather than the Asymmetric distance.Specifically,  we push the number of codebooks M equal to its maximal limit (i.e., the code length B), and the number of codewords K inside each codebook is forced as 2 to satisfy the equation B = M log 2 K.This way, the state of each bit in the B-bit binary code will be decided by a sub-space with 2 codewords.Under this extreme setting, we can either evaluate MICPQ with the Hamming distance evaluation or the Asymmetric distance.We name both models as MICPQ-EH (i.e., Extreme-Hamming) and MICPQ-EA (i.e., Extreme-Asymetric).As shown in Figure 1, the MICPQ-EA consistently outperforms the MICPQ-EH thanks to the superiority of Asymmetric Distance computation.Also, MICPQ-EH can be seen as the semantic hashing model since it's evaluated with Hamming distance, and the better retrieval performance of MICPQ-EH when compared to CSH informs us that we can take the special form of MICPQ as one of the excellent semantic hashing methods.To sum up, the proposed MICPQ is more flexible and powerful than the existing hashing baselines in terms of efficient document retrieval.

Evaluating the Quality of Codewords from the Clustering Perspective
To examine how well the codewords can represent the corresponding refined vectors zm (x), we set the number of codewords K in each codebook as the number of the ground-truth classes on datasets (i.e., K = 26, 14 and 4 on NYT, DBpedia and AGNews respectively), so that we can compute the unsu- pervised clustering accuracy with the help of the Hungarian algorithm (Kuhn, 1955) in each codebook.The number of codebooks M is set as 8 on all datasets.For comparison, we also run K-Means on each zm (x) separately.As shown in Table 3, the codewords in our MICPQ are on par with the ones learned by K-Means.Particularly, our MICPQ significantly outperforms K-means on NYT that has the largest number of classes (i.e., 26), demonstrating the ability of our MICPQ to learn high-quality codewords, especially for datasets with diverse categories.

Sensitive Analyses of Hyper-Parameters
We investigate the influence of 4 key hyperparameters: the coefficient λ, the dropout rate p drop , the temperature τ cl in contrastive learning, and the Gumbel-Softmax temperature τ .As shown in Figure 2, compared with the case of λ = 0, obvious performance gains can be observed by introducing the mutual-information loss inside each codebook when the λ is set as a relatively small value (e.g., 0.2 or 0.3).The value of p drop controls the strength of data augmentation.We see that as p drop exceeds some thresholds (e.g., 0.4), the performance will decrease sharply, and eventually the model will collapse on both datasets.Also, it is shown that as τ cl grows up, the precision first increases and reaches the peak when τ cl is around 0.3 on both datasets.As for the Gumbel-Softmax temperature τ , we suggest setting it to a larger value, saying [5,15].

Conclusion
In this paper, we proposed a novel unsupervised product quantization model, namely MICPQ.In MICPQ, we managed to develop an end-toend probabilistic contrastive product quantization model to jointly refine the original BERT embeddings and quantize the refined embeddings into codewords, with a probabilistic contrastive loss designed to preserve the semantic information in the quantized representations.Moreover, to improve the representativeness of codewords for keeping the cluster structure of documents soundly, we proposed to maximize the mutual information between data and the codeword assignment.Extensive experiments showed that our model significantly outperformed existing unsupervised hashing methods.

Limitations
In our work, we do not analyze the individuality of codebooks and the difference between codebooks.However, each codebook should show different aspects of characteristics for preserving rich semantics.Therefore, it may help if the difference between codebooks is more analyzed and improved this way.Also, in the future we may try to apply this method to the tasks of passage retrieval, on which some large-scale datasets for evaluation are available.Three datasets are used to evaluate the performance of the proposed model.1) NYT (Tao et al., 2018) is a dataset which contains news articles published by The New York Times; 3) AGNews (Zhang et al., 2015) is a news collection gathered from academic news search engines; 3) DBpedia (Lehmann et al., 2015) is a dataset which contains the abstracts of articles from Wikipeida.We simply apply the string cleaning operation same as in (Ou et al., 2021b).The statistics of the three datasets are shown in Table 4.

B Retrieval
For testing, we first encode all the documents in the search set (i.e., the training set in our setting).For each document x i in the search set, we encode it using the hard quantization operation to obtain the codeword index {k 1 i , k 2 i , • • • , k M i }.Then all these codeword indices are concatenated and stored in the form of the M log 2 K-bits binary code.
In terms of the retrieval stage, when given a query document x q , we first extract its refined embedding z(x q ) through the encoder network and then slice it into M equal-length segments as z(x q ) = [z 1 (x q ), z1 (x q ), • • • , zM (x q )].Then we compute the Asymmetric distance (AD) (Jégou et al., 2011) between the query x q and the documents x i with the squared Euclidean distance metric as: where h m (x i ) = C m •one_hot(k m i ) represents the quantized representation (i.e., one of codewords) of x i in the m-th codebook.To compute that, we can first pre-compute a query-specific distance look-up table of size M × K that stores the distance between the segment zm (x q ) and all codewords in each codebook.With the pre-computed look-up table, AD(x q , x i ) can be efficiently computed by summing up the chosen values from the look-up table.It is only slightly more costly when compared with the efficiency of the Hamming distance.We are interested in if taking the average embeddings of the last layers in BERT as inputs will lead to a better performance than [CLS] or not.The experiments are conducted in our proposed model MICPQ and the simple baseline CSH.Table 5 shows that these two settings achieve better performance in different datasets respectively.For simplicity, we consistently use the [CLS] representation in our experiments.

Table 1 :
The precision(%) comparision with different state-of-the-art unsupervised efficient retrieval methods.Among them, bold numbers represent best performance, and underlined numbers represent second best performance.

Table 3 :
Average and maximal accuracy(%) over M codebooks on NYT.

Table 4 :
Statistics of Three Benchmark Datasets.

Table 5 :
The performance comparision between the [CLS] representations and the average embeddings of BERT.