Precise Zero-Shot Dense Retrieval without Relevance Labels

Luyu Gao; Xueguang Ma; Jimmy Lin; Jamie Callan

doi:10.18653/v1/2023.acl-long.99

Precise Zero-Shot Dense Retrieval without Relevance Labels

Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan

Abstract

While dense retrieval has been shown to be effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance labels are available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first zero-shot prompts an instruction-following language model (e.g., InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is “fake” and may contain hallucinations. Then, an unsupervised contrastively learned encoder (e.g., Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, from which similar real documents are retrieved based on vector similarity. This second step grounds the generated document to the actual corpus, with the encoder’s dense bottleneck filtering out the hallucinations. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers across various tasks (e.g. web search, QA, fact verification) and in non-English languages (e.g., sw, ko, ja, bn).

Anthology ID:: 2023.acl-long.99
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1762–1777
Language:
URL:: https://aclanthology.org/2023.acl-long.99/
DOI:: 10.18653/v1/2023.acl-long.99
Bibkey:
Cite (ACL):: Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.99.pdf
Video:: https://aclanthology.org/2023.acl-long.99.mp4

PDF Cite Search Video Fix data