Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval

Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse retrieval techniques such as BM25, but at the cost of large space and memory requirements. In this paper, we analyze the redundancy present in encoded dense vectors and show that the default dimension of 768 is unnecessarily large. To improve space efficiency, we propose a simple unsupervised compression pipeline that consists of principal component analysis (PCA), product quantization, and hybrid search. We further investigate other supervised baselines and find surprisingly that unsupervised PCA outperforms them in some settings. We perform extensive experiments on five question answering datasets and demonstrate that our best pipeline achieves good accuracy–space trade-offs, for example, 48\times compression with less than 3% drop in top-100 retrieval accuracy on average or 96\times compression with less than 4% drop. Code and data are available at http://pyserini.io/.


Introduction
Dense passage retrieval (DPR; Karpukhin et al., 2020) improves end-to-end retrieval accuracy in open-domain question answering (QA) by representing queries and documents in a lowdimensional, dense vector space. However, the vastly increased space and memory demands for storing and loading the dense vectors call for effective compression methods (Izacard et al., 2020). For example, the size of the original DPR vector (flat) index on the Wikipedia corpus is about 61 GB, while its sparse counterpart-BM25 inverted index-only uses 2.4 GB. The staggering increase of around 25× in space requirements only yields an average gain of 2.5% in top-100 accuracy across * Equal contribution five datasets , indicating potential redundancy in the dense representations.
In this work, we quantify redundancy within the dense vectors encoded by the DPR model using explained variance ratio and mutual information. Figure 1 shows that the original 768 dimensions is unnecessarily large as both the mutual information and explained variance ratio start to plateau at around 256 dimensions. Based on this observation, we further propose a simple yet effective pipeline for dense retrieval compression that includes principal component analysis (PCA), product quantization (PQ), and hybrid search to reduce index size while retaining effectiveness.
We also compare other compression options, including supervised dimensionality reduction, where we fine-tune a linear projection layer on top of the pre-trained DPR model using relevance labels. Surprisingly, we find that PCA achieves top-100 retrieval accuracy that is better than the two supervised dimensionality reduction techniques for 256 and 128 dimensions, while supervised techniques outperform (unsupervised) PCA for 64 dimensions. Our techniques support different accuracy-space trade-offs, but one sweet spot manages to compress the dense vectors by 96× with less than 4% drop in top-100 accuracy on average across five standard QA datasets. Finally, we incorporate our pipeline with the BM25 inverted index for sparse-dense hybrid search, where we can achieve 16× compression without any accuracy drop compared to the original DPR results.

Background and Related Work
Sparse retrieval methods such as BM25 (Robertson and Zaragoza, 2009;Yang et al., 2017) have established strong baselines in open-domain QA (Chen et al., 2017;Yang et al., 2019). Recently, dense retrieval emerges as a promising alternative (Karpukhin et al., 2020;Zhan et al., 2020;Xiong et al., 2021;Hofstätter et al., 2020;Lin et al., 2020) in end-to-end question answering, but at the cost of increased space requirements. Efforts have been made towards developing memory efficient baselines (Izacard et al., 2020), but the topic still remains under-explored. In the following, we briefly introduce how dense retrieval works during training and inference.
Given a collection of passages and a QA task, DPR (Karpukhin et al., 2020) adopts a bi-encoder structure where encoders f Q (·) and f D (·) are independent BERT (Devlin et al., 2019) models that encode questions/passages into dense vectors. The relevance between the question q and passage d is defined by the dot product between their corresponding vectors as The relevance score is used to rank the passages during retrieval with nearest neighbor search techniques. During training, given a question q, a positive passage d + that contains the answer for q, and m negative passages d − 1 , d − 2 , ...d − m , the training objective is: where p(D = d + | Q = q) can be seen as a classifier given the question q and evaluated at d + . Normally, the [CLS] output of the BERT model is used as the dense representation and its default dimension is 768.

Compressing Dense Representations
As mentioned above, DPR's encoders produce dense vectors of 768 dimensions by default, which we will show is unnecessarily large below. In this section, we discuss how to quantify the redundancy in the encoded vectors and how to improve DPR's space efficiency by reducing this redundancy.

Quantifying Redundancy
We use two metrics to quantify the redundancy in dense vectors: explained variance ratio of the principal components and mutual information between the question vectors and passage vectors. The explained variance ratio of PCA is defined as: where σ 2 i is the variance corresponding to the i th largest eigenvalue, n is the original dimension, and m is the reduced dimension. This ratio tells us how much variance is retained by preserving the first m eigenvectors of the dense representations. Another way to evaluate redundancy is using the mutual information between the dense representation of questions Q and the retrieved passages D, which can be approximated using the classifier in Eq. (1): is the training/dev/test questions. This quantity is upper-bounded by ln N and the normalized mutual information is I(Q; D)/ ln N . Figure 1 shows the explained variance and mutual information of compressed vectors with different dimensions reduced by PCA. As we can see, ∼90% variance and ∼99% mutual information is held by the first ∼200 dimensions. However, as the dimension further decreases, useful information is discarded at a higher rate and the dense representation starts to degrade visibly. The figure illustrates that a dimension of ∼200 could be a sweet spot in accuracy-space trade-offs. We also find in later experiments that a dimension of 256 indeed achieves the best balance among other choices as shown in Figure 2, which will be discussed in Section 5.3.

Dense Vector Compression
We explore three different types of dimensionality reduction techniques given dense representations from a pre-trained DPR model: Supervised Approach We apply a linear transformation to the pre-trained dense vector representations, and fine-tune this linear layer with relevance labels according to Eq. (1). Independent linear layers W q and W p are added to the question encoder and passage encoder, respectively. To make this compression technique a plug-and-play component, we only optimize the linear layer while freezing the rest of the networks. In addition, we add an orthogonality regularization term ||W p W q − I|| 2 to the original loss function in Eq. (1), where I is the identity matrix. Such regularization encourages the weights in W p and W q to be orthogonal while retaining the most information in the original dense vectors.
Unsupervised Approach A popular technique, principle component analysis (PCA), can effectively reduce the dimensionality of high dimensional vectors while retaining most of the variance within the original representation. We fit a linear PCA transformation using the combination of all question and passage vectors to learn a compressed representation based on a pre-trained DPR model. During inference, the same transformation is applied to each question and the relevance score is the dot product between the compressed question and passage vectors.
Product Quantization On top of the supervised and unsupervised dimensionality reduction techniques described above, we further leverage product quantization (PQ), which decomposes the original d-dimensional vector into s sub-vectors. Each sub-vector is quantized using k-means and eventually stored with t bits (Jégou et al., 2011). For example, the original 768 dimension dense vector occupies 768 × 32 bits. By dividing it into 192 sub-vectors of 8 bits, the storage space becomes 192 × 8 bits, which is 1/16 of the original size. On average, space is reduced from 32 bits to 2 bits per dimension.

Experimental Setup
Datasets and Metrics We evaluate the top-k retrieval accuracy of our compression methods on five QA datasets examined in the original DPR paper (Karpukhin et al., 2020): NQ, TriviaQA, WQ, CuratedTREC, and SQuAD. The top-k retrieval accuracy is defined as the fraction of questions that have at least one correct answer span in the top-k retrieved passages. Following previous work, we use k ∈ {20, 100}. We use the combination of NQ, TriviaQA, WQ, and CuratedTREC to train our models, following the same setting as DPR.
Model Training For DPR, instead of the original Facebook implementation, we use the implementation of Gao et al. (2021), which takes advantage of gradient caching to save GPU memory usage and mixed precision training to speed up the learning process. We find that using a learning rate of 10 −6 and training the model for 40 epochs achieve better effectiveness than the default DPR setting. We refer to the original DPR model, which has 768 dimensional output vectors, as DPR-768. Other hyperparameters are identical to default DPR (Karpukhin et al., 2020).
For the compression methods, we consider the reduced dimensions d ∈ {256, 128, 64} according to Figure 1. For the supervised approach, the linear layer is trained for one epoch with a learning rate of 10 −3 while freezing the backbone DPR model (i.e., BERT), and we refer to these models as Lineard. For comparison, we train models with identical architecture to Linear-d, but without freezing the BERT model, and refer to them as DPR-d.
For the unsupervised PCA approach, we fit the PCA transformation with question and passage embeddings produced by the original DPR-768 model. The question embeddings are from the original embeddings of questions in the training set. A total of 160k passage embeddings are randomly sampled from the passage embeddings of the entire corpus (i.e., original dense index).
We utilize the PQ features from Faiss (Johnson et al., 2021). The number of codewords for each sub-vector is fixed at 256 (i.e., using 8 bits). Following Izacard et al. (2020), we change the number of sub-vectors for dense embeddings such that the average memory of each dimension is reduced. Combined with dimensionality reduction methods, we decrease the occupied space of each dimension from 32 bits to 1 or 2 bits.  have shown that hybrid search significantly improves the retrieval accuracy of DPR. Therefore, we fuse the DPR's and BM25's retrieval results following their same rule, where the final hybrid score is calculated by score dense + α · score sparse . We set α = 1, which equally weights dense and sparse scores for all hybrid search cases as it is a good default option for hybrid search in general.

Results
In this section, we characterize the trade-offs between retrieval accuracy and space requirements using different combinations of our proposed techniques. All experiments are implemented using the Pyserini IR toolkit . Table 1 shows the top-{20, 100} accuracy of the three dimensionality reduction techniques presented in Section 3.2 evaluated at d = {256, 128, 64} on five benchmark QA datasets. DPR-768 achieves the best accuracy, which serves as the upper bound for compression. Overall, the top-100 accuracy decreases as we reduce the dense vectors to fewer dimensions. However, we see that PCA-256 only has a 0.6 ∼ 1.5% drop in accuracy on {NQ, TriviaQA, WQ, CuratedTREC} while achieving 3× compression. Model quality degrades more on SQuAD, since our models are trained on the combination of the other four datasets. We further evaluate retrieval latency for different dimensionality levels in Table 1. Although latency is not the focus of this work, it is still worth noticing that dimensionality reduction also reduces retrieval latency as it speeds up dot-product calculations. For example, reducing dimensionality from 768 to 256 can reduce retrieval latency by three times. The latency is measured by query encoding time + brute-force retrieval time using a machine with Intel Xeon Platinum 8160 2.10GHz CPU using Faiss FlatIP indexes. Both query encoding and retrieval are performed with a single CPU thread.

Dimensionality Reduction
Across different dimensions, the unsupervised PCA method often works the best in terms of the trade-off between accuracy and compression rate. It is surprising that unsupervised PCA outperforms the other two supervised methods at dimensions 256 and 128. Although the supervised methods might be further improved with carefully-tuned hyperparameters, training DPR can be computationally expensive, while PCA is the more robust and economical method to achieve comparable results. Another surprising finding is that Linear-d generally outperforms DPR-d, which means that freezing the backbone DPR model and fine-tuning only the linear layer seem to work better than training the entire model end to end to generate compressed representations. However, this finding may be simply due to poor hyperparameter selection.

Product Quantization
Product quantization (PQ) can often aggressively reduce the index size while largely preserving retrieval effectiveness. For example, PQ2 (meaning each dimension occupies 2 bits on average after PQ) reduces the size of DPR-768's vectors by 16× with only 0.7% loss in top-100 retrieval accuracy on average.   trieval accuracy of our dimensionality reduction methods with different PQ settings. Combined with product quantization, the dimensionality reduction methods achieve a much higher compression rate with, in some cases, only a modest loss in retrieval accuracy. In addition, we find that PQ2 outperforms PQ1 on most datasets and dimensions, as PQ1 suffers more than twice the accuracy loss compared to PQ2. It appears that compression to 1 bit per dimension is too aggressive and represents a poor trade-off. If we restrict the retrieval accuracy drop to within 4% on average, we can compress the dense vectors by up to 96×, reducing the original DPR index from 61 GB to mere hundreds of MB on the Wikipedia corpus. Figure 2 shows the trade-off between retrieval accuracy and index size with different combinations of dimensionality reduction, product quantization, and hybrid search. On each curve, the points from left to right represent PQ1, PQ2, and w/o PQ, respectively. Sparse retrieval with the BM25 inverted index is shown as the black triangle. The dashed lines represent sparse-dense hybrid retrieval; these lines include the size of the BM25 inverted index. In the plot, up and to the left represents better: higher accuracy and smaller indexes.

Hybrid Search
As an example, the pipeline consisting of PCA-256, PQ2, and HS reduces the (total) index size from 61 GB to 3.7 GB (57 GB or roughly 16× smaller) and even yields 0.2% gain in top-100 accuracy compared to DPR-768. The dotted black line in Figure 2 shows the Pareto Frontier, which can be understood as the best achievable accuracy for a particular restriction on index size. Overall, we see that the DPR-768 (orange) line does not lie on the frontier, which means that some combination of our techniques is strictly more accurate and smaller than the original DPR representations.

Conclusions
This paper analyzes the redundancy within dense representations from DPR, a popular dense retrieval model. We propose a simple yet effective compression pipeline that enables trade-offs between space and accuracy, which drastically reduces index size while preserving end-to-end retrieval accuracy at reasonable levels.