The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents


Introduction
Information retrieval traditionally used sparse representations like TF-IDF or BM25 to retrieve relevant documents for a given query. However, these approaches suffer from the lexical gap problem (Berger et al., 2000).
To overcome this issue, dense representations have been proposed (Gillick et al., 2018): Queries and documents are mapped to a dense vector space and relevant documents are retrieved e.g. by using cosine-similarity. Out-performance over sparse lexical approaches has been shown for various datasets (Gillick et al., 2018;Guo et al., 2020;Guu et al., 2020;Gao et al., 2020).
Previous work showed the out-performance for fixed, rather small indexes. The largest dataset where it has been shown is the MS Marco (Bajaj et al., 2018) passage retrieval dataset, where retrieval is done over an index of 8.8 million text passages. However, in production scenarios, index sizes quickly reach 100 millions of documents.
We show in this paper, that the performance for dense representations can decrease quicker for increasing index sizes than for sparse representations. For a small index of e.g. 100k documents, a dense approach might clearly outperform sparse approaches. However, with a larger index of several million documents, the sparse approach can outperform the dense approach.
We show theoretically and empirically that this effect is closely linked to the number of dimensions for the representations: Using fewer dimensions increases the chances for false positives. This effect becomes more severe with increasing index sizes.

Related Work
A common choice for dense retrieval is to finetune a transformer network like BERT (Devlin et al., 2018) on a given training corpus with queries and relevant documents (Guo et al., 2020;Guu et al., 2020;Gao et al., 2020;Karpukhin et al., 2020;Luan et al., 2020). Recent work showed that combining dense approaches with sparse, lexical approaches can further boost the performance (Luan et al., 2020;Gao et al., 2020). While the approaches have been tested on various information and question answering retrieval datasets, the performance was only evaluated on fixed, rather small indexes. Guo et al. (2020) evaluated approaches for eight different datasets having index sizes between 3k and 454k documents.
We are not aware of previous work that compares sparse and dense approaches for increasing index sizes and the connection to the dimensionality. The only work we are aware of that systematically studies the encoding size for dense approaches is (Luan et al., 2020), but they only studied the connection to the document length.

Theory
Dense retrieval approaches map queries and documents 1 to a fixed size dense vector. The most relevant documents for a given query can then be found using cosine-similarity.
Using as few dimensions as possible is desirable, as it decreases the memory requirement to store (an index) of millions of vectors and leads to faster retrieval. However, as we show, lower-dimensional representations can have issues with large indices.
Given a query vector q ∈ R k , we search our index of document vectors d 1 , ..., d n ∈ R k for the documents that maximizes: Note: In the following we just show the case for cosine similarity. The proof extends to other similarity functions like dot-product and any p-norm (Manhatten, Euclidean) as long as the vector space is finite. A finite n-dimensional vector space can be mapped to an n+1-dimensional vectors space with vectors of unit length. In that case, dot-product in n dimensions is equivalent to cosine-similarity in n + 1 dimensions. Similar, any p-norm in n dimensions can be re-written as cosine-similarity in n + 1 dimensions.
Theorem: The probability for false positives (I) increases with the index size n and (II) with the decreasing dimensionality k.
Proof (I): Given a query q and the relevant document d r . For simplicity, we assume only a single relevant document. If multiple documents are relevant, we consider only the one with the highest cosine similarity. In order that no false positive is returned, cossim(q, d r ) must be greater than cossim(q, d i ) for all i = r. Assume the possible vectors are independent. Then, the probability for a false positive is for an index with n − 1 negative elements and P (false positive i ) the probability that a single element is a false positive, i.e. cossim(q, d i ) > cossim(q, d r ).
Proof (II): While the previous proof is straightforward, that the chance of false positives increases with larger index sizes, the more interesting aspect is the relation to the dimensionality, i.e., what is the probability P (false positive i ) 1 We use document as a cover-term for text of any length. = P (cossim(q, d i ) > cossim(q, d r )) for a random d i ? We show that this probability decreases with more dimensions.
Without loss of generality, we assume that the vectors are of unit length. The vectors are then on an k-dimensional sphere with radius 1. A false pos- I.e., we intersect the sphere in k dimensions with a hyperplane in k − 1 dimensions. The area of the cut-off portion is defined by 1 − cossim(q, d r ). All vectors within the cut-off portion (i.e. spherical cap) are false positives. The probability that a random vector will be returned as false positive is: with A cap the surface area of the spherical cap and A sphere the surface area of the sphere in k dimensions. Define the surface area of the sphere in k dimensions as A k , then the surface area of A cap is (Li, 2011): with I x (a, b) the regularized incomplete beta function and θ the polar angle, i.e. the angle between q and the relevant document d r . Hence: For constant cosine similarity between query q and relevant document d r , I sin 2 θ k−1 2 , 1 2 is a monotonically decreasing function with increasing dimension k. In conclusion, more dimensions decrease the probability for false positives.
Combining (I) and (II) shows that a low dimensional representation might work well for small index sizes. However, with more indexed documents, the probability of false positives increases faster for low dimensional representations than for higher dimensional representations. Hence, at some index size, higher dimensional representations might outperform the lower-dimensional representation.

Empirical Investigation
In the proof, we have assumed that vectors are independent and uniformly distributed over the space, which gives us a lower bound on the false positive rate. However, in practice, dense representations are neither independent nor uniformly distributed. As shown in (Ethayarajh, 2019;Li et al., 2020), dense representations derived from pre-trained Transformers like BERT map to an anisotropic space, i.e., the vectors occupy only a narrow cone in the vector space. This drastically increases the chance that an irrelevant document is closer to the query embedding than the relevant document. Hence, we study how actual dense models are impacted by increasing index sizes and lowerdimensional representations.

Dataset
We conduct our experiments on the MS MARCO passage dataset (Bajaj et al., 2018). It consists of over 1 million unique real queries from the Bing search engine, together with 8.8 million paragraphs from heterogeneous web sources. Most of the queries have only 1 passage judged as relevant, even though more can exist. The development set consists of 6980 queries and the performance is evaluated using mean reciprocal rank MRR@10.
To better compare the relative performance differences, we compute a rank-aware error rate: 1 − 1 rank i with rank i being the rank of the relevant document for the i-th query. To be compatible with MRR@10, we set rank i = ∞ for rank i > 10. We then define the relative error rate as Err Dense /Err BM25 . A relative error rate of 50% indicates that the dense approach makes only 50% of the errors compared to BM25 retrieval.

Model
For sparse, lexical retrieval, we use ElasticSearch, which is based on BM25. For dense retrieval, we use a DistilRoBERTa-base model (Sanh et al., 2020) as a bi-encoder: The query and the passage are passed independently to the transformer model and the output is averaged to create fixed-sized representations. We train this using InfoNCE loss (van den Oord et al., 2018): with q the query, p + the relevant passage. We use in-batch negative sampling and use the other passages in a batch as negative examples. We found that τ = 20 performs well. We train the model in two setups: 1) only with random (in-batch) negatives, and 2) we provide for each query additionally one hard-negative passage. We use the hardnegative passages provided by the MS MARCO dataset, which were retrieved using lexical search. Models are trained with a batch size of 128 with Adam optimizer and a learning rate of 2e − 5.
DistilRoBERTa produces representations with 768 dimensions. We also experiment with lowerdimensional representations. There, we added a linear projection layer on-top of the mean pooling operation to down-project the representation to either 128 or 256 dimensions. Dense retrieval is performed using cosine similarity with exact search.

Experiments
First, we study the impact of increasing index sizes with real text passages. Then, we study the performance when random noise is added.

Increasing Index Size
In the first experiment, we start with an index that only contains the 7433 relevant passages for the 6980 queries. Then, we add step-wise randomly selected passages from the MS MARCO corpus to the index until all 8.8 million passages are indexed.  Table 1 shows the MRR@10 performance for the different systems. Increasing the index naturally decreases the performance for all systems, as retrieving the correct passages from a larger index is more challenging. The dense approach trained without hard negatives clearly outperforms BM25 for an index with 10k -1M entries, but with all 8.8 million passages it performs worse than BM25. Table 2 shows the relative error rate in comparison to BM25 retrieval. For small index sizes, we  observe that dense approaches drastically reduce the error rate compared to BM25 retrieval. With increasing index sizes, the gap closes.

Index with Random Noise
MS MARCO is sparsely labeled, i.e., there is usually only a single passage labeled as relevant even though multiple passages would be considered as relevant by humans (Craswell et al., 2020). To avoid that the drop in performance is due to the retrieval of relevant, but unlabeled passages, we perform an experiment where we add random irrelevant noise to the index. Our index consists only of the relevant passages and a large fraction of irrelevant, randomly generated strings. 3 We also evaluate the popular DPR system by Karpukhin et al. (2020), which is a BERT-based dense retriever trained on the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019). We chose the NQ dev set, consisting of 1772 questions from Google search logs. DPR encodes the passage as Title [SEP] Paragraph. We create a random string for the paragraph and combine it with 1) a randomly generated string as title, 2) selecting randomly one of the over 6 Million real Wikipedia article titles, 3) selecting randomly one of the 1772 article titles found in the NQ dev set.
We count for how many queries a random string is ranked higher than the relevant passage. The results are shown in Table 3. We observe that BM25 does not rank any randomly generated passage higher than the relevant passage for the MS MARCO dataset. The chance that a random passage contains words matching the query is small.
For the dense retrieval models, we observe for quite a large number of queries that a random string passage is ranked higher than the relevant passage. As proven in Section 3, the error increases with larger index sizes and fewer dimensions.  For DPR, we observe an extreme dependency on the title. Having 100 million entries in the index with a real Wikipedia article title and a random paragraph, results in the retrieval of those for about 12.08% of all questions at the top position.
The error numbers far exceed the estimation from equation (1), confirming that the representations are not uniformly distributed over the complete vector space and are concentrated in a small space. In the appendix (Figure 1), we plot the representations for the queries, the relevant passages, and the random strings.

Conclusion
We have proven and shown empirically that the probability for false positives in dense information retrieval depends on the index size and on the dimensionality of the used representations. These approaches can even retrieve completely irrelevant, randomly generated passages with high probability. It is important to understand the limitations of dense retrieval: 1) Dense approaches work better for smaller, clean indexes. With increasing index size the difference to sparse approaches can decreases.
2) Evaluation results with smaller indexes cannot be transferred to larger index sizes. A system that is state-of-the-art for an index of 1 million documents might perform badly on larger indices.
3) The false positive rate increases with fewer dimensions.
4) The empirically found error rates far exceeded the mathematical lower-bound error rates, indicating that only a small fraction of the available vector space is effectively used.
A Plot of Random Noise Index Figure 1 shows a two-dimensional plot of the 6980 development queries in the MS MARCO passage dataset, together with the 7433 passages that are marked as relevant and 7433 representations for randomly generated strings (using lowercase characters and space with a random length between 20 and 150 characters). The representation for the random strings are concentrated, but we still observe a significant overlap with the region for queries and relevant documents. This explains why random strings are retrieved for certain queries (Table 3). We use the dense model that was trained with hard negatives with 768 dimensions. UMAP (McInnes et al., 2018) is used for dimensionality reduction to 2 dimensions.