Matching-oriented Embedding Quantization For Ad-hoc Retrieval

Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and quantization models can be jointly trained with supervised learning. However, there is a lack of appropriate formulation of the joint training objective; thus, the improvements over previous non-supervised baselines are limited in reality. In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated. With the minimization of MCL, we are able to maximize the matching probability of query and ground-truth key, which contributes to the optimal retrieval accuracy. Given that the exact computation of MCL is intractable due to the demand of vast contrastive samples, we further propose the Differentiable Cross-device Sampling (DCS), which significantly augments the contrastive samples for precise approximation of MCL. We conduct extensive experimental studies on four real-world datasets, whose results verify the effectiveness of MoPQ. The code is available at https://github.com/microsoft/MoPQ.


Introduction
Ad-hoc retrieval is critical for many intelligent services, like web search and recommender systems. Given a query (e.g., a search request from user), ad-hoc retrieval selects relevant keys (e.g., webpages) from a massive set of candidates. It is usually treated as an approximate nearest neighbour search (ANNS) problem, where product quantization (PQ) (Jegou et al., 2010) is one of the most popular solutions thanks to its competitive memory and time efficiency. PQ is built upon "codebooks", with which an input embedding can * . Work is done during the internship in Microsoft.
be quantized into a Cartesian product of "codewords" (preliminaries about the codebooks and codewords will be given in Section 2.1). By this means, the original embedding can be compressed into a highly compact representation. More importantly, it significantly improves the retrieval efficiency, as query and key's similarity can be approximated based on the pre-computed distances between query and codewords.
Existing Limitation. The original PQ algorithms (Jegou et al., 2010;Ge et al., 2013) are non-supervised: based on the well-trained embeddings, the quantization model is learned with heuristic algorithms (e.g., k-means). In recent years, many works are dedicated to supervised PQ (Cao et al., , 2017Klein and Wolf, 2019;, where the embedding model and the quantization model are trained jointly. The supervised PQ requires an explicit training objective for the quantization model. Most of the time, "the minimization of reconstruction loss" is used for granted: the distortion between the original embedding (z) and its quantization result (z) needs to be reduced as much as possible for all the keys (k) in database: min k z k −z k 2 . The above objective is seemingly plausible, as a "small enough distortion" will make the quantized embeddings equally expressive as the original embeddings. However, it implicitly hypothesizes that the distortion can be made sufficiently small, which is not always realistic in practice. This is because the reconstruction loss is subject to a lower-bound determined by the codebooks scale. As over-scaled codebooks result in prohibitive memory and time costs, there will always exist a positive reconstruction loss in reality. In this case, it can be proved that the minimization of reconstruction loss doesn't lead to the optimal retrieval accuracy. It's also empirically verified that the supervised PQ's advantage over the non-supervised baselines are limited and not consistently positive when the reconstruction loss minimization is taken as the training objective (see Section 2.2 and 4 for detailed analysis).
Our Work. To address the above challenge, we propose the Matching-oriented Product Quantization (MoPQ), with a novel objective MCL formulated to optimize PQ's retrieval accuracy, together with a sample augmentation strategy DCS to ensure the effective minimization of MCL.
• The Multinoulli Contrastive Loss (MCL) is formulated as the new quantization training objective. The PQ-based ad-hoc retrieval can be probabilistically modeled by a cascaded generative process: 1) select codewords for the input key, based on which the quantized key embedding is composited; 2) sample query from the Multinoulli distribution determined by the quantized key embedding. The negative of the generation likelihood is referred as the Multinoulli Contrastive Loss (MCL). By minimizing MCL, the expected querykey matching probability will be optimized, which means the optimal retrieval accuracy.
• The contrastive sample augmentation is designed to facilitate the minimization of MCL. The computation of MCL is intractable as it requires the normalization over vast contrastive samples (the quantized embeddings of all the keys). Thus, it has to be approximated by sampling whose effect is severely affected by sample size. In this work, we propose the Differentiable Crossdevice Sampling (DCS), where a distributed embedding sharing mechanism is introduced for contrastive sample augmentation. As the gradients are stopped at the cross-device shared embeddings, we propose a technique called the "combination of Primary and Image Losses", where the shared embeddings are made virtually differentiable to keep the model update free from distortions.
In summary, our work identifies a long-existing defect about the training objective of supervised PQ; meanwhile, a simple but effective remedy is proposed, which optimizes the expected retrieval accuracy of PQ. We make extensive experimental studies with four benchmark text retrieval datasets, where our proposed methods significantly outperform the SOTA supervised PQ baselines. Our code and datasets will be made public-available to facilitate the research progress in related areas.

Revisit of Supervised PQ
We start with the preliminaries of PQ's application in ad-hoc retrieval. Then, we analyze the defect of taking the reconstruction loss minimization as supervised PQ's training objective.

Preliminaries
• Product Quantization (PQ). PQ is a popular approach for learning quantized representations. It is based on the foundation of M codebooks C: {C 1 , ..., C M }; each C i consists of L codewords: {C i1 , ..., C iL }; each C ij is a vector whose dimension is d/M . Given an embedding z (with dimension d), it is firstly sliced into to M sub-vectors [z 1 , z 2 , ..., z M ]; then the subvector z i is assigned to one of the codewords of codebook C i , whose ID is denoted by the onehot vector b i . The assignment is made by the codeword selection function, which maps each sub-vector z i to its most relevant codeword, e.g., b i = one hot(argmin||z i − C i * || 2 ). As a result, the embedding z is quantized into a collection of codes: B = {b 1 , ..., b M }, where the embedding itself is approximated by the concatenation of the assigned codewords: Non-supervised PQ takes the well-trained embeddings as input, and learns the quantization model with heuristic algorithms, like k-means. In contrast, supervised PQ jointly learns the embedding and quantization models based on labeled data (the paired query and key). Specifically, it learns the query and key's embeddings: z q and z k , such that the query and key's relationship can be predicted based on their inner product z q , z k . Furthermore, it learns the codebooks where the quantized embeddings may preserve the same expressiveness as the original embeddings.
• PQ for ad-hoc retrieval. PQ is also widely applied for ad-hoc retrieval. On one hand, the float vectors are quantized for high memory efficency. On the other hand, the retrieval process can be significantly accelerated. For each key within database, the embedding z k is quantized asz k = [C 1 b 1 , ..., C M b M ] 1 . For a given query embedding z q , the inner-product with the M codebooks can be enumerated and kept within the distance table T q , where T q [i, j] = z q i , C ij . Finally, the query and key's inner product z q ,z k can be efficiently derived by looking up the pre-computed re-sults within T q : 1,...,M T q [i, b i ], where no more dot product computations are needed.

Defect of Reconstruction Loss
The reconstruction loss minimization is usually adopted as the quantization model's training objective in supervised PQ . It requires the distortions between key embedding and its quantization result to be reduced as much as possible: The minimization of reconstruction loss is seemingly plausible given the following hypothesises.
Hypothesis 2.1. The quantized keys' embeddings are less accurate in predicting query and key's relevance, compared with the non-quantized ones.
Hypothesis 2.2. The loss of prediction accuracy is a monotonously increasing function of the reconstruction loss (defined in Eq. 1).
The first hypothesis can be "assumed correct" in reality, considering that the quantized embeddings are less expressive than the original embeddings (due to the finite number of codewords). However, the second hypothesis is problematic. In the following part, we analyze the underlying defect from both theoretical and empirical perspectives.

Theoretical Analysis 2
We theoretically analyze two properties about the reconstruction loss: 1) it is indelible; and 2) decreasing of the reconstruction loss does not necessarily improve the prediction accuracy.

Theorem 2.1. (Positive Reconstruction Loss)
The reconstruction loss is positive if the codebooks' scale is smaller than the key's scale.
That is to say, the input will always be changed after quantization given a reasonable scale of codebooks, making it impossible to keep the quantized embeddings equally expressive as the original embeddings by eliminating the reconstruction loss. Moreover, we further show that the reduction of the reconstruction loss doesn't necessarily improve the prediction accuracy.
We start by showing the existence of "quantization invariant perturbation", i.e., the codeword assignment will stay unchanged when such perturbations are added to the codebooks. Lemma 2.1. For each codebook C i , there will always exist perturbation vectors like i , where the manipulation of codewords:Ĉ i * ← C i * + i , doesn't affect the codeword assignment.
On top of the existence of quantization invariant perturbations, we may derive the "Non-monotone" about the relationship between the prediction accuracy and the reconstruction loss.
Theorem 2.2. (Non-Monotone) PQ's prediction accuracy is not monotonically increasing with the reduction of the reconstruction loss.
Proof. The statement is proved by contradiction. For each codebook, we generate a perturbation which satisfies Lemma 2.1 and add it to the codewords:Ĉ i * ← C i * + i . According to the Lemma 2.1, the codeword assignment will not change, so the quantized key embedding become: . Now we may derive the following relationship about the reconstruction losses: (The first equivalence holds conditioned on the independence between z k −z k and .) 3 That is to say, the reconstruction loss is increased after the perturbation. At the same time, we may also derive the query and key's relationship before (R) and after (R) the perturbation: In other words, the relationship between query and key is preserved despite the increased reconstruction loss. Thus, a contradiction is obtained for the monotonous relationship between the prediction accuracy and reconstruction loss.

Empirical Analysis
To further verify the theoretical conclusions, we empirically analyze the relationship between the prediction accuracy and the reconstruction loss 3. The is generated based on an arbitrary unit vector, so the independence condition always holds. Figure 1: PQ's retrieval workflow: (I) Quantization: the key's embedding (z k ) is assigned to codes, whose related codewords are composited as the quantized key embedding (z k ); (II) Matching: the quantized key embedding (one of the centroids of the Voronoi diagram) confines the targeted query embedding z q within its own Voronoi cell.  Table 1: Relationships between the reconstructive loss (R-Loss) and PQ's accuracy (Recall@10). The reconstruction loss is monotonously reduced when its weight becomes larger. However, the smallest reconstruction loss (marked by " ") doesn't lead to the optimal accuracy (marked in bold).
as Table 1, by taking the deep quantization network  (DQN for short) as the example 4 . The weight of reconstruction loss is tuned as: 1, 1e-1, 1e-2, 1e-3 (larger weights will lead to higher loss reduction), and the weight of embedding learning is fixed to 1. We find that the reconstruction loss is monotonously reduced when a larger learning weight is used. However, the smallest reconstruction loss doesn't bring the highest prediction accuracy (measured with Re-call@10), which echos our theoretical analysis. More experimental studies in Section 4 demonstrate that the supervised PQ's advantages over the non-supervised baselines are inconsistent and sometimes insignificant, when the reconstruction loss minimization is taken as the objective (even with the optimally tuned weights).
To summarize, the reconstruction loss cannot be eliminated with a feasible scale of codebooks, and the reduction of the reconstruction loss doesn't necessarily improve the retrieval accuracy. As such, we turn around to formulate a new objective, where the model will stay with the reconstruction loss but optimize the PQ's retrieval accuracy.
4. More detailed experiment settings are given in Section 4.

Matching-oriented PQ
We present the Matching-oriented PQ (MoPQ) in this section. In MoPQ, a new quantization objective MCL is proposed, whose minimization optimizes the query-key's matching probability, therefore bringing about the optimal retrieval accuracy. Besides, the DCS method is introduced, which facilitates the effective minimization of MCL.

Multinoulli Contrastive Loss
The ad-hoc retrieval with PQ can be divided into two stages, shown as Figure 1. The first stage is called Quantization. The key embedding is assigned to a series of binary codes, each of which is corresponding to one codeword within a codebook. The assigned codewords are composited as the quantized key embedding, which is a centroid of the Voronoi diagram determined by the codebooks. The second stage is called Matching, where the quantized key embedding confines the targeted query embedding within its own Voronoi cell. The above matching process is probabilistically modeled as the Multinoulli Generative Process (as Figure 2): • For each codebook i, a codeword is sampled from the Multinoulli distribution: Mul(C ij |z k i ), denoted asz k i ; the quantized key embeddingz k is generated as the concatenation of {z k i } M . • The query is sampled from the distribution: Mul(z q |z k ), parameterized by the quantized key embeddingz k and query embedding z q .
Thus, the matching probability can be factorized by the joint distribution of making codeword selection from all codebooks: i P (C ij |z k i ), and sampling query based on z q andz k : P (z q |z k ): (" j " indicates the enumeration of all possible codeword selection.) We expect to maximize the above query-key matching probability so as to achieve the optimal retrieval accuracy. However, the exact calculation is intractable due to: 1) the almost infinite combinations of codewords, and 2) the unknown distribution of the queries. In this place, the following transformations are made. Firstly, we leverage the Straight Through Estimator (Bengio et al., 2013), where P (C ij |z k i ) is transformed by the hard thresholding function: The above probability is calculated as P = (P − P ).sg() + P such that the gradients can be back propagated ("sg() " is the stop gradient operation). Now the generation probability is simplified as: where j * = argmax{P (C i * |z k i )}. (" j " can be removed because P (C ij |z k i ) = 0, ∀j =j * .) Secondly, we make a further transformation for the query's generation probability: where P (z q ) and P (z k ) are the prior probabilities regarded as unknown constants. The conditional probability P (z k |z q ) calls for the normalization over the quantized key embeddings {z k }, which is deterministic and predefined. Now the generation probability is transformed as: where "s(·)" is the code assignment function (detailed forms will be discussed in Section 4.1). The negative logarithm of the final simplification in Eq. 2 is called the Multinoulli Contrastive Loss (MCL). It is used as our quantization training objective, whose minimization optimizes the query and key's matching probability.

Approximating MCL with DCS
MCL calls for the normalization over all keys' quantized embeddings {z k }, whose computation cost is huge. It has to be approximated by negative sampling (Bengio et al., 2013;Huang et al., 2020), where a subset of keys are encoded as the normalization term. Recent works (Gillick et al., 2019;Luan et al., 2020;Wu et al., 2020b;Karpukhin et al., 2020) use in-batch contrastive samples, which are free of extra encoding cost: for the i-th training instance within a mini-batch, the j-th key's quantized embedding (j = i) will be used as a contrastive sample. Thus, there will be N − 1 cost free contrastive samples in total (N : batch size). Recent studies also leverage cross device in-batch sampling for sample augmentation in distributed environments (Yingqi et al., 2020). Particularly, a training instance on one device may take advantage of quantized key embeddings on other devices as its contrastive samples. Thus, the contrastive samples will be increased by ×D times (D: the number of devices). A problem about the cross device in-batch sampling is that the shared embeddings from other devices are not differentiable (because the shared values need to be detached from their original computation graphs). As a result, the partial gradients cannot be backpropagated through the cross-device contrastive samples, which causes distortions of the model's update, thus undermines the optimization effect.
• DCS Method. We propose the Differentiable Cross-device in-batch Sampling (DCS), which enables partial gradients to be back propagated for the cross-device contrastive samples. The overall workflow is summarized with Alg. 1, where the core technique is referred as the "combination of Primary and Image losses". Suppose a total of D GPU devices are deployed, each one processes N training instances. The training instances encoded on the i-th Device are denoted as {Q i 1...N , K i 1...N }. 5 The embeddings generated on each device will be broadcasted to all the other devices. As a result, the i-th device will maintain two sets of embeddings: 1) {Q i 1...N , K i 1...N }, which are locally encoded and differentiable, and 2) {Q =i 1...N , K =i 1...N }, which are broadcasted from other devices and thus non-differentiable. Each training instance (Q i j , K i j ) will have two losses computed in parallel: the primary loss on the device where it is encoded (i.e., Device-i), and the image loss on the devices where it is broadcasted.
Take the first instance on Device-1 (Q 1 1 , K 1 1 ) for illustration (as Figure 3). The query-key matching 5. Q i j and K i j : the query embedding (z q ) and the quantized key embedding (z k ) of the j-th training instance on device i. probability P (K 1 1 |Q 1 1 ) on Device-1 is: . (3) The cross-device embeddings are detached, therefore, the partial gradients are stopped at these variables (marked as Q and K). The above query-key matching probability will be used by MCL (for "P (z k |z q )") in Eq. 2, whose result is called the primary loss w.r.t. (Q 1 1 , K 1 1 ); the sum of primary losses for all (Q 1 * , K 1 * ) is denoted as L 1 p . Meanwhile, the query-key matching probability P (K 1 1 |Q 1 1 ) is also computed on all other devices. For the i-th device (i = 1), P (K 1 1 |Q 1 1 ) becomes: . (4) The differentiability is partially inverted compared with P (K 1 1 |Q 1 1 ) in Eq. 3: K i j becomes differentiable, but Q 1 1 and K 1 j become non-differentiable. The above probability is used to derive another MCL, which is called the image loss of (Q 1 1 , K 1 1 ) on Device-i; the sum of image losses of all (Q 1 * , K 1 * ) on Device-i is denoted as L 1i c . Clearly, the above image loss will compensate the stopped gradients (related to K i j ) in the primary loss. The primary losses and image losses are gathered from all GPU devices and added up, based on which the model parameters θ are updated w.r.t. the following partial gradients: It can be verified that the above results are equivalent to the partial gradients derived from the fol-lowing full-differentiable distributions: . (6) Thus, the partial gradients are free from distortions caused by the non-differentiable variables, which enables MCL to be precisely approximated with the cross-device augmented contrastive samples.

Experiment Settings
• Data. We use three open datasets. Quora 6 , with question pairs of duplicated meanings (Wang et al., 2019); we use one question to retrieve its counterpart. News 7 , with news articles from Microsoft News (Wu et al., 2020a); we use the headline to retrieve the news body. Wiki 8 , with passages from Wikipedia; we use the first sentence to retrieve the remaining part of the passage. One large industrial dataset Search Ads, with user's search queries and clicked ads from a worldwide search engine; we use search queries to retrieve the titles of clicked ads. (As Table 2.) • Baselines. We consider both supervised and non-supervision PQ as our baselines. For supervised PQ, the embedding model and the quantization model are learned jointly. For non-supervised PQ, the embedding model is learned at first; then the quantization model is learned with fixed embeddings. We consider the following supervised methods. DQN , which is learned with two objectives: the embedding model is learned to match the query and key, and the quantization model is learned to minimize the reconstruction loss. DVSQ (Cao et al., 2017), which adapts the reconstruction loss to minimize the distortion of query and key's inner product. SPQ (Klein and Wolf, 2019), which minimizes the disagreement between the hard and soft allocated codewords. DPQ , which still minimizes the reconstruction loss as DQN, but leverages a different quantization module. We also include 2 non-supervised baselines: the vanilla PQ (Jégou et al., 2011), and OPQ (Ge et al., 2013).Although OPQ is non-supervised, it learns the transformation of the input embeddings such that the reconstruction loss can be minimized. 6. https://www.kaggle.com/c/quora-question-pairs 7. https://msnews.github.io 8. https://deepgraphlearning.github.io/project/wikidata5m  We implement two MoPQ alternatives: 1) MoPQ b , the basic form with MCL and conventional in-batch sampling; 2) MoPQ a , the advanced form with both MCL and DCS.
• Implementations. We use BERT-like Transformers (Devlin et al., 2018) for text encoding: the #layer is 4 and the hidden-dimension is 768. The input text is uncased and tokenized with WordPiece (Wu et al., 2016). The algorithms are implemented in PyTorch 1.8.0. We consider the following codeword selection functions : 1) l2 (default option in reality), where the codeword is selected based on Euclidean distance: . Our code and data will be made public-available. Pseudo codes for the algorithms, more comprehensive results and implementation details are put into supplementary materials.

Experiment Analysis
The experimental studies focus on three major issues: 1) the overall comparisons between MoPQ and the existing PQ baselines; 2) the impact of DCS; 3) the impacts of codebook configurations, like codeword selection and codebook size.
• Overall Comparisons. The overall comparison results are shown in Table 3. The matching accuracy is measured by Recall@N (R@N). We use 8 codebooks by default, each of which has 256 codewords (M = 8, L = 256). The best performances are marked in bold; the most competitive baseline performances are underlined.
Firstly, it can be observed that the basic form MoPQ b consistently outperforms all the baselines, with 7.8%∼19.5% relative improvements over the most competitive baselines on different datasets. With the enhancement of DCS, MoPQ a further improves the performances by 11.0%∼42.7%, relatively. Both observations validate the effective-   ness of our proposed methods. As for the baselines: the supervised PQ's performances are comparatively higher than the non-supervised ones; however, the improvements are not consistent: in some cases, OPQ achieves comparable or even higher recall rates than some of the supervised PQ.
Secondly, we further clarify the relationship between the reconstruction loss and the query-key matching accuracy. More results on reconstruction loss are reported in Table 4. For one thing, we find that by jointly learning the embedding and quantization models, DQN's reconstruction losses become significantly lower than PQ. At the same time, DQN's recall rates are consistently higher than PQ. Such observations indicate that: to some extent, the reduction of reconstruction loss may help to improve PQ's retrieval accuracy. On the other hand, the reconstruction losses of MoPQ b and MoPQ a are much higher than DQN, but it dominates the baselines in terms of recall rate. Such observations echo the theoretical and empirical findings in Section 2.2: PQ's query-key retrieval accuracy will not monotonously increase with the reduction of reconstruction loss.
A brief conclusion for the above observations: although the minimization of reconstruction loss still contributes to PQ's retrieval accuracy, the improvement is limited due to the non-monotone between both factors. In contrast, the minimization of MCL directly maximizes the query-key's matching probability, which makes MoPQ achieve much more competitive retrieval accuracy.
• Impact of DCS. More analysis about DCS is shown in the upper of Table 5. We consider two baselines: 1) the conventional in-batch sampling, where no cross-device sampling is made; 2) the Non-differentiable Cross-device Sampling (NCS), which also makes embeddings shared across GPU devices for contrastive sample augmentation, but no image loss is computed to compensate the stopped gradients. It is observed that both DCS and NCS outperform the in-batch sampling, thanks to the augmentation of contrastive samples. However, the improvement of NCS is limited compared with DCS. As discussed in Section 3.2, the gradients are stopped at the nondifferentiable variables of NCS, which causes distortions for the model's update. Thus, the model's training outcome is restricted because of it.
• Impact of codeword selection. We use MoPQ b as the representative to analyze the impact of different forms of codeword selections in the lower of Table 5. We find that MoPQ's performances are not sensitive to the codeword selection function, as the experiment results are quite close to each other. Given that the l2 selection's performance is slightly better and no extra parameters are introduced, it is used as our default choice.
• Impact of codebook size. We analyze the impact of codebook size in Table 6; the SPQ baseline is included for comparison. With the expansion of scale, i.e., more codebooks and more codewords per codebook, MoPQ's performance can be improved gradually. In all of the settings, MoPQ maintains its advantage over SPQ. It is also observed that in some cases, MoPQ outperforms SPQ with even smaller codebook size, e.g., MoPQ (M=4, L=256) and SPQ (M=8, L=256) on News; in other words, the higher recall rate is achieved with smaller space and time costs.

Conclusion
In this paper, we propose MoPQ to optimize PQ's ad-hoc retrieval accuracy. A systematic revisit is made for the existing supervised PQ, where we identify the limitation of using reconstruction loss minimization as the training objective. We propose MCL as our new training objective, where the model can be learned to maximize the querykey matching probability to achieve the optimal retrieval accuracy. We further leverage DCS for contrastive sample argumentation, which ensures the effective minimization of MCL. The experiment results on 4 real-world datasets validate the effectiveness of our proposed methods.