Summarization-Based Document IDs for Generative Retrieval with Language Models

Alan Li, Daniel Cheng, Phillip Keung, Jungo Kasai, Noah A. Smith


Abstract
Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document’s ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-grams as proposed in past work. We find that abstractive, content-based IDs (ACID) and an ID based on the first 30 tokens are very effective in direct comparisons with previous approaches to ID creation. We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively versus the cluster-based integer ID baseline on the MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the Wikipedia-based NQ 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs created through summarization for generative retrieval. We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO, which suggests that document characteristics affect generative retrieval performance.
Anthology ID:
2024.wikinlp-1.18
Volume:
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Lucie Lucie-Aimée, Angela Fan, Tajuddeen Gwadabe, Isaac Johnson, Fabio Petroni, Daniel van Strien
Venue:
WikiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
126–135
Language:
URL:
https://aclanthology.org/2024.wikinlp-1.18
DOI:
Bibkey:
Cite (ACL):
Alan Li, Daniel Cheng, Phillip Keung, Jungo Kasai, and Noah A. Smith. 2024. Summarization-Based Document IDs for Generative Retrieval with Language Models. In Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia, pages 126–135, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Summarization-Based Document IDs for Generative Retrieval with Language Models (Li et al., WikiNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wikinlp-1.18.pdf