Keith Hall

Also published as: Keith B. Hall


2024

pdf bib
HYRR: Hybrid Infused Reranking for Passage Retrieval
Jing Lu | Keith Hall | Ji Ma | Jianmo Ni
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Existing passage retrieval systems typically adopt a two-stage retrieve-then-rerank pipeline. To obtain an effective reranking model, many prior works have focused on improving the model architectures, such as leveraging powerful pretrained large language models (LLM) and designing better objective functions. However, less attention has been paid to the issue of collecting high-quality training data. In this paper, we propose HYRR, a framework for training robust reranking models. Specifically, we propose a simple but effective approach to select training data using hybrid retrievers. Our experiments show that the rerankers trained with HYRR are robust to different first-stage retrievers. Moreover, evaluations using MS MARCO and BEIR data sets demonstrate our proposed framework effectively generalizes to both supervised and zero-shot retrieval settings.

pdf bib
OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement
Yang Gao | Ji Ma | Ivan Korotkov | Keith Hall | Dana Alon | Donald Metzler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related papers in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we develop multilingual SDSM models by adjusting and extending the state-of-the-art methods designed for English SDSM tasks. We find that: (i)Some highly successful methods in English SDSM yield significantly worse performance in multilingual SDSM. (ii)Our best model, which enriches the non-English papers with English summaries, outperforms strong baselines by 7% (in mean average precision) on multilingual SDSM tasks, without compromising the performance on English SDSM tasks.

2022

pdf bib
Large Dual Encoders Are Generalizable Retrievers
Jianmo Ni | Chen Qu | Jing Lu | Zhuyun Dai | Gustavo Hernandez Abrego | Ji Ma | Vincent Zhao | Yi Luan | Keith Hall | Ming-Wei Chang | Yinfei Yang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited compared to models with fine-grained interactions between the query and the passage. In this paper, we challenge this belief by scaling up the size of the dual encoder model while keeping the bottleneck layer as a single dot-product with a fixed size. With multi-stage training, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. We further analyze the impact of the bottleneck layer and demonstrate diminishing improvement when scaling up the embedding size. Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform previous sparse and dense retrievers on the BEIR dataset significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to match the out-of-domain performance of using all supervised data.

pdf bib
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
Jianmo Ni | Gustavo Hernandez Abrego | Noah Constant | Ji Ma | Keith Hall | Daniel Cer | Yinfei Yang
Findings of the Association for Computational Linguistics: ACL 2022

We provide the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods to construct Sentence-T5 (ST5) models: two utilize only the T5 encoder and one using the full T5 encoder-decoder. We establish a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark. Our encoder-only models outperform the previous best models on both SentEval and SentGLUE transfer tasks, including semantic textual similarity (STS). Scaling up ST5 from millions to billions of parameters shown to consistently improve performance. Finally, our encoder-decoder method achieves a new state-of-the-art on STS when using sentence embeddings.

2021

pdf bib
Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation
Ji Ma | Ivan Korotkov | Yinfei Yang | Keith Hall | Ryan McDonald
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

A major obstacle to the wide-spread adoption of neural retrieval models is that they require large supervised training sets to surpass traditional term-based techniques, which are constructed from raw corpora. In this paper, we propose an approach to zero-shot learning for passage retrieval that uses synthetic question generation to close this gap. The question generation system is trained on general domain data, but is applied to documents in the targeted domain. This allows us to create arbitrarily large, yet noisy, question-passage relevance pairs that are domain specific. Furthermore, when this is coupled with a simple hybrid term-neural model, first-stage retrieval performance can be improved further. Empirically, we show that this is an effective strategy for building neural passage retrieval models in the absence of large training corpora. Depending on the domain, this technique can even approach the accuracy of supervised models.

pdf bib
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing

pdf bib
Overview of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing

With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 2nd Workshop on Scholarly Document Processing (SDP) at NAACL 2021 as a virtual event (https://sdproc.org/2021/). The SDP workshop consisted of a research track, three invited talks, and three Shared Tasks (LongSumm 2021, SCIVER, and 3C). The program was geared towards the application of NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.

2020

pdf bib
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Brian Roark | Lawrence Wolf-Sonkin | Christo Kirov | Sabrina J. Mielke | Cibu Johny | Isin Demirsahin | Keith Hall
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.

2019

pdf bib
Text Genre and Training Data Size in Human-like Parsing
John Hale | Adhiguna Kuncoro | Keith Hall | Chris Dyer | Jonathan Brennan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Domain-specific training typically makes NLP systems work better. We show that this extends to cognitive modeling as well by relating the states of a neural phrase-structure parser to electrophysiological measures from human participants. These measures were recorded as participants listened to a spoken recitation of the same literary text that was supplied as input to the neural parser. Given more training data, the system derives a better cognitive model — but only when the training examples come from the same textual genre. This finding is consistent with the idea that humans adapt syntactic expectations to particular genres during language comprehension (Kaan and Chun, 2018; Branigan and Pickering, 2017).

2016

pdf bib
Cross-lingual projection for class-based language models
Beat Gfeller | Vlad Schogol | Keith Hall
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2014

pdf bib
Projecting the Knowledge Graph to Syntactic Parsing
Andrea Gesmundo | Keith Hall
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf bib
Russian Stress Prediction using Maximum Entropy Ranking
Keith Hall | Richard Sproat
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Universal Dependency Annotation for Multilingual Parsing
Ryan McDonald | Joakim Nivre | Yvonne Quirmbach-Brundage | Yoav Goldberg | Dipanjan Das | Kuzman Ganchev | Keith Hall | Slav Petrov | Hao Zhang | Oscar Täckström | Claudia Bedini | Núria Bertomeu Castelló | Jungmee Lee
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Tutorials)
Johan Bos | Keith Hall
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Tutorials)

2012

pdf bib
Using Search-Logs to Improve Query Tagging
Kuzman Ganchev | Keith Hall | Ryan McDonald | Slav Petrov
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

pdf bib
Beam-Width Prediction for Efficient Context-Free Parsing
Nathan Bodenstab | Aaron Dunlop | Keith Hall | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Multi-Source Transfer of Delexicalized Dependency Parsers
Ryan McDonald | Slav Petrov | Keith Hall
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Training dependency parsers by jointly optimizing multiple objectives
Keith Hall | Ryan McDonald | Jason Katz-Brown | Michael Ringgaard
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
Distributed Training Strategies for the Structured Perceptron
Ryan McDonald | Keith Hall | Gideon Mann
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Learning Dense Models of Query Similarity from User Click Logs
Fabio De Bona | Stefan Riezler | Keith Hall | Massimiliano Ciaramita | Amaç Herdaǧdelen | Maria Holmqvist
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Instance Sense Induction from Attribute Sets
Ricardo Martin-Brualla | Enrique Alfonseca | Marius Pasca | Keith Hall | Enrique Robledo-Arnuncio | Massimiliano Ciaramita
Coling 2010: Posters

2009

pdf bib
A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches
Eneko Agirre | Enrique Alfonseca | Keith Hall | Jana Kravalova | Marius Paşca | Aitor Soroa
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Large-scale Computation of Distributional Similarities for Queries
Enrique Alfonseca | Keith Hall | Silvana Hartmann
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Reconstructing False Start Errors in Spontaneous Speech Text
Erin Fitzgerald | Keith Hall | Frederick Jelinek
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Integrating sentence- and word-level error identification for disfluency correction
Erin Fitzgerald | Frederick Jelinek | Keith Hall
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries
Enrique Alfonseca | Massimiliano Ciaramita | Keith Hall
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Large-scale Semantic Networks: Annotation and Evaluation
Václav Novák | Sven Hartrumpf | Keith Hall
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

2008

pdf bib
Inter-sentential Coreferences in Semantic Networks: An Evaluation of Manual Annotation
Václav Novák | Keith Hall
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present an evaluation of inter-sentential coreference annotation in the context of manually created semantic networks. The semantic networks are constructed independently be each annotator and require an entity mapping priori to evaluating the coreference. We introduce a model used for mapping the semantic entities as well as an algorithm used for our evaluation task. Finally, we report the raw statistics for inter-annotator agreement and describe the inherent difficulty in evaluating coreference in semantic networks.

2007

pdf bib
K-best Spanning Tree Parsing
Keith Hall
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Generation in Machine Translation from Deep Syntactic Trees
Keith Hall | Petr Němec
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf bib
Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation
Markus Dreyer | Keith Hall | Sanjeev Khudanpur
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf bib
Log-Linear Models of Non-Projective Trees, k-best MST Parsing and Tree-Ranking
Keith Hall | Jiří Havelka | David A. Smith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Corrective Models for Speech Recognition of Inflected Languages
Izhak Shafran | Keith Hall
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

2005

pdf bib
Corrective Modeling for Non-Projective Dependency Parsing
Keith Hall | Václav Novák
Proceedings of the Ninth International Workshop on Parsing Technology

2004

pdf bib
Attention Shifting for Parsing Speech
Keith B. Hall | Mark Johnson
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)