Patrick Lewis


2022

pdf bib
Domain-matched Pre-training Tasks for Dense Retrieval
Barlas Oguz | Kushal Lakhotia | Anchit Gupta | Patrick Lewis | Vladimir Karpukhin | Aleksandra Piktus | Xilun Chen | Sebastian Riedel | Scott Yih | Sonal Gupta | Yashar Mehdad
Findings of the Association for Computational Linguistics: NAACL 2022

Pre-training on larger datasets with ever increasing model size isnow a proven recipe for increased performance across almost all NLP tasks.A notable exception is information retrieval, where additional pre-traininghas so far failed to produce convincing results. We show that, with theright pre-training setup, this barrier can be overcome. We demonstrate thisby pre-training large bi-encoder models on 1) a recently released set of 65 millionsynthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io.We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

pdf bib
Challenges in Generalization in Open Domain Question Answering
Linqing Liu | Patrick Lewis | Sebastian Riedel | Pontus Stenetorp
Findings of the Association for Computational Linguistics: NAACL 2022

Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is unclear which aspects of novel questions make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel-entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set – indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities relatively well, they struggle with those requiring compositional generalization. Lastly, we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.

pdf bib
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge
Rajarshi Das | Patrick Lewis | Sewon Min | June Thai | Manzil Zaheer
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

pdf bib
Boosted Dense Retriever
Patrick Lewis | Barlas Oguz | Wenhan Xiong | Fabio Petroni | Scott Yih | Sebastian Riedel
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component models, making it a drop-in replacement for standard dense retrievers at test time. DrBoost enjoys several advantages compared to standard dense retrieval models. It produces representations which are 4x more compact, while delivering comparable retrieval results. It also performs surprisingly well under approximate search with coarse quantization, reducing latency and bandwidth needs by another 4x. In practice, this can make the difference between serving indices from disk versus from memory, paving the way for much cheaper deployments.

2021

pdf bib
Proceedings of the 3rd Workshop on Machine Reading for Question Answering
Adam Fisch | Alon Talmor | Danqi Chen | Eunsol Choi | Minjoon Seo | Patrick Lewis | Robin Jia | Sewon Min
Proceedings of the 3rd Workshop on Machine Reading for Question Answering

pdf bib
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
Patrick Lewis | Yuxiang Wu | Linqing Liu | Pasquale Minervini | Heinrich Küttler | Aleksandra Piktus | Pontus Stenetorp | Sebastian Riedel
Transactions of the Association for Computational Linguistics, Volume 9

Abstract Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off” to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

pdf bib
Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
Patrick Lewis | Pontus Stenetorp | Sebastian Riedel
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding train sets. In addition, we find that 60-70% of answers in the test sets are also present in the train sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can generalize, and what drives their overall performance. We find that all models perform substantially worse on questions that cannot be memorized from train sets, with a mean absolute performance difference of 61% between repeated and non-repeated data. Finally we show that simple nearest-neighbor models outperform a BART closed-book QA model, further highlighting the role that train set memorization plays in these benchmarks

pdf bib
KILT: a Benchmark for Knowledge Intensive Language Tasks
Fabio Petroni | Aleksandra Piktus | Angela Fan | Patrick Lewis | Majid Yazdani | Nicola De Cao | James Thorne | Yacine Jernite | Vladimir Karpukhin | Jean Maillard | Vassilis Plachouras | Tim Rocktäschel | Sebastian Riedel
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

2020

pdf bib
MLQA: Evaluating Cross-lingual Extractive Question Answering
Patrick Lewis | Barlas Oguz | Ruty Rinott | Sebastian Riedel | Holger Schwenk
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making building QA systems that work well in other languages challenging. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area. MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA has over 12K instances in English and 5K in each other language, with each instance parallel between 4 languages on average. We evaluate state-of-the-art cross-lingual models and machine-translation-based baselines on MLQA. In all cases, transfer results are shown to be significantly behind training-language performance.

pdf bib
Proceedings of the 5th Workshop on Representation Learning for NLP
Spandana Gella | Johannes Welbl | Marek Rei | Fabio Petroni | Patrick Lewis | Emma Strubell | Minjoon Seo | Hannaneh Hajishirzi
Proceedings of the 5th Workshop on Representation Learning for NLP

pdf bib
Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art
Patrick Lewis | Myle Ott | Jingfei Du | Veselin Stoyanov
Proceedings of the 3rd Clinical Natural Language Processing Workshop

A large array of pretrained models are available to the biomedical NLP (BioNLP) community. Finding the best model for a particular task can be difficult and time-consuming. For many applications in the biomedical and clinical domains, it is crucial that models can be built quickly and are highly accurate. We present a large-scale study across 18 established biomedical and clinical NLP tasks to determine which of several popular open-source biomedical and clinical NLP models work well in different settings. Furthermore, we apply recent advances in pretraining to train new biomedical language models, and carefully investigate the effect of various design choices on downstream performance. Our best models perform well in all of our benchmarks, and set new State-of-the-Art in 9 tasks. We release these models in the hope that they can help the community to speed up and increase the accuracy of BioNLP and text mining applications.

pdf bib
Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin | Barlas Oguz | Sewon Min | Patrick Lewis | Ledell Wu | Sergey Edunov | Danqi Chen | Wen-tau Yih
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.

pdf bib
Unsupervised Question Decomposition for Question Answering
Ethan Perez | Patrick Lewis | Wen-tau Yih | Kyunghyun Cho | Douwe Kiela
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We aim to improve question answering (QA) by decomposing hard questions into simpler sub-questions that existing QA systems are capable of answering. Since labeling questions with decompositions is cumbersome, we take an unsupervised approach to produce sub-questions, also enabling us to leverage millions of questions from the internet. Specifically, we propose an algorithm for One-to-N Unsupervised Sequence transduction (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions. We answer sub-questions with an off-the-shelf QA model and give the resulting answers to a recomposition model that combines them into a final answer. We show large QA improvements on HotpotQA over a strong baseline on the original, out-of-domain, and multi-hop dev sets. ONUS automatically learns to decompose different kinds of questions, while matching the utility of supervised and heuristic decomposition methods for QA and exceeding those methods in fluency. Qualitatively, we find that using sub-questions is promising for shedding light on why a QA system makes a prediction.

2019

pdf bib
Language Models as Knowledge Bases?
Fabio Petroni | Tim Rocktäschel | Sebastian Riedel | Patrick Lewis | Anton Bakhtin | Yuxiang Wu | Alexander Miller
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as “fill-in-the-blank” cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https://github.com/facebookresearch/LAMA.

pdf bib
Unsupervised Question Answering by Cloze Translation
Patrick Lewis | Ludovic Denoyer | Sebastian Riedel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we explore to what extent high quality training data is actually required for Extractive QA, and investigate the possibility of unsupervised Extractive QA. We approach this problem by first learning to generate context, question and answer triples in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically. To generate such triples, we first sample random context paragraphs from a large corpus of documents and then random noun phrases or Named Entity mentions from these paragraphs as answers. Next we convert answers in context to “fill-in-the-blank” cloze questions and finally translate them into natural questions. We propose and compare various unsupervised ways to perform cloze-to-natural question translation, including training an unsupervised NMT model using non-aligned corpora of natural questions and cloze questions as well as a rule-based approach. We find that modern QA models can learn to answer human questions surprisingly well using only synthetic training data. We demonstrate that, without using the SQuAD training data at all, our approach achieves 56.4 F1 on SQuAD v1 (64.5 F1 when the answer is a Named Entity mention), outperforming early supervised models.

2018

pdf bib
Interpretation of Natural Language Rules in Conversational Machine Reading
Marzieh Saeidi | Max Bartolo | Patrick Lewis | Sameer Singh | Tim Rocktäschel | Mike Sheldon | Guillaume Bouchard | Sebastian Riedel
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader’s background knowledge. One example is the task of interpreting regulations to answer “Can I...?” or “Do I have to...?” questions such as “I am working in Canada. Do I have to carry on paying UK National Insurance?” after reading a UK government website about this topic. This task requires both the interpretation of rules and the application of background knowledge. It is further complicated due to the fact that, in practice, most questions are underspecified, and a human assistant will regularly have to ask clarification questions such as “How long have you been working abroad?” when the answer cannot be directly derived from the question and text. In this paper, we formalise this task and develop a crowd-sourcing strategy to collect 37k task instances based on real-world rules and crowd-generated questions and scenarios. We analyse the challenges of this task and assess its difficulty by evaluating the performance of rule-based and machine-learning baselines. We observe promising results when no background knowledge is necessary, and substantial room for improvement whenever background knowledge is needed.