Jamie Callan


2022

pdf bib
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval
Luyu Gao | Jamie Callan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent research demonstrates the effectiveness of using fine-tuned language models (LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i) fragility to training data noise and ii) requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, and the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

2021

pdf bib
Condenser: a Pre-training Architecture for Dense Retrieval
Luyu Gao | Jamie Callan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text comparison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations. This paper finds a key reason is that standard LMs’ internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation. We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation. Our experiments show Condenser improves over standard LM by large margins on various text retrieval and similarity tasks.

pdf bib
Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
Luyu Gao | Yunyi Zhang | Jiawei Han | Jamie Callan
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Contrastive learning has been applied successfully to learn vector representations of text. Previous research demonstrated that learning high-quality representations benefits from batch-wise contrastive loss with a large number of negatives. In practice, the technique of in-batch negative is used, where for each example in a batch, other batch examples’ positives will be taken as its negatives, avoiding encoding extra negatives. This, however, still conditions each example’s loss on all batch examples and requires fitting the entire large batch into GPU memory. This paper introduces a gradient caching technique that decouples backpropagation between contrastive loss and the encoder, removing encoder backward pass data dependency along the batch dimension. As a result, gradients can be computed for one subset of the batch at a time, leading to almost constant memory usage.

pdf bib
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List
Luyu Gao | Zhuyun Dai | Jamie Callan
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Classical information retrieval systems such as BM25 rely on exact lexical match and can carry out search efficiently with inverted list index. Recent neural IR models shifts towards soft matching all query document terms, but they lose the computation efficiency of exact match systems. This paper presents COIL, a contextualized exact match retrieval architecture, where scoring is based on overlapping query document tokens’ contextualized representations. The new architecture stores contextualized token representations in inverted lists, bringing together the efficiency of exact match and the representation power of deep language models. Our experimental results show COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.

2020

pdf bib
Making Information Seeking Easier: An Improved Pipeline for Conversational Search
Vaibhav Kumar | Jamie Callan
Findings of the Association for Computational Linguistics: EMNLP 2020

This paper presents a highly effective pipeline for passage retrieval in a conversational search setting. The pipeline comprises of two components: Conversational Term Selection (CTS) and Multi-View Reranking (MVR). CTS is responsible for performing the first-stage of passage retrieval. Given an input question, it uses a BERT-based classifier (trained with weak supervision) to de-contextualize the input by selecting relevant terms from the dialog history. Using the question and the selected terms, it issues a query to a search engine to perform the first-stage of passage retrieval. On the other hand, MVR is responsible for contextualized passage reranking. It first constructs multiple views of the information need embedded within an input question. The views are based on the dialog history and the top documents obtained in the first-stage of retrieval. It then uses each view to rerank passages using BERT (fine-tuned for passage ranking). Finally, MVR performs a fusion over the rankings produced by the individual views. Experiments show that the above combination improves first-state retrieval as well as the overall accuracy in a reranking pipeline. On the key metric of NDCG@3, the proposed combination achieves a relative performance improvement of 14.8% over the state-of-the-art baseline and is also able to surpass the Oracle.

pdf bib
Modularized Transfomer-based Ranking Framework
Luyu Gao | Zhuyun Dai | Jamie Callan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval. However, these Transformers are computationally expensive, and their opaque hidden states make it hard to understand the ranking process. In this work, we modularize the Transformer ranker into separate modules for text representation and interaction. We show how this design enables substantially faster ranking using offline pre-computed representations and light-weight online interactions. The modular design is also easier to interpret and sheds light on the ranking process in Transformer rankers.

2012

pdf bib
Collectively Representing Semi-Structured Data from the Web
Bhavana Dalvi | William Cohen | Jamie Callan
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

2010

pdf bib
Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
Donghui Feng | Jamie Callan | Eduard Hovy | Marius Pasca
Proceedings of the NAACL HLT 2010 Workshop on Semantic Search

2009

pdf bib
A Metric-based Framework for Automatic Taxonomy Induction
Hui Yang | Jamie Callan
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model
Anagha Kulkarni | Jamie Callan
Proceedings of ACL-08: HLT, Short Papers

2007

pdf bib
Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts
Michael Heilman | Kevyn Collins-Thompson | Jamie Callan | Maxine Eskenazi
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf bib
Automatic and Human Scoring of Word Definition Responses
Kevyn Collins-Thompson | Jamie Callan
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

2005

pdf bib
Combining Multiple Forms of Evidence While Filtering
Yi Zhang | Jamie Callan
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

1996

pdf bib
Chinese Information Extraction and Retrieval
Sean Boisen | Michael Crystal | Erik Peterson | Ralph Weischedel | John Broglio | Jamie Callan | Bruce Croft | Theresa Hand | Thomas Keenan | Mary Ellen Okurowski
TIPSTER TEXT PROGRAM PHASE II: Proceedings of a Workshop held at Vienna, Virginia, May 6-8, 1996