Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval

Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, Jimmy Lin


Abstract
Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse retrieval techniques such as BM25, but at the cost of large space and memory requirements. In this paper, we analyze the redundancy present in encoded dense vectors and show that the default dimension of 768 is unnecessarily large. To improve space efficiency, we propose a simple unsupervised compression pipeline that consists of principal component analysis (PCA), product quantization, and hybrid search. We further investigate other supervised baselines and find surprisingly that unsupervised PCA outperforms them in some settings. We perform extensive experiments on five question answering datasets and demonstrate that our best pipeline achieves good accuracy–space trade-offs, for example, 48× compression with less than 3% drop in top-100 retrieval accuracy on average or 96× compression with less than 4% drop. Code and data are available at http://pyserini.io/.
Anthology ID:
2021.emnlp-main.227
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2854–2859
Language:
URL:
https://aclanthology.org/2021.emnlp-main.227
DOI:
10.18653/v1/2021.emnlp-main.227
Bibkey:
Cite (ACL):
Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, and Jimmy Lin. 2021. Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2854–2859, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval (Ma et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.227.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.227.mp4