Adam Lelkes


2023

pdf bib
How Does Generative Retrieval Scale to Millions of Passages?
Ronak Pradeep | Kai Hui | Jai Gupta | Adam Lelkes | Honglei Zhuang | Jimmy Lin | Donald Metzler | Vinh Tran
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100K in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.

pdf bib
SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes
Adam Lelkes | Eric Loreaux | Tal Schuster | Ming-Jun Chen | Alvin Rajkomar
Findings of the Association for Computational Linguistics: EMNLP 2023

Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data, largely due to the privacy and regulatory constraints on the use of real patients’ information. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly. We formulate SDOH extraction as a natural language inference task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in that our premises and hypotheses are obtained independently. We evaluate both “off-the-shelf” entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets.

2021

pdf bib
AgreeSum: Agreement-Oriented Multi-Document Summarization
Richard Yuanzhe Pang | Adam Lelkes | Vinh Tran | Cong Yu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021