William Cohen

Also published as: William W. Cohen


2024

pdf bib
SEMQA: Semi-Extractive Multi-Source Question Answering
Tal Schuster | Adam Lelkes | Haitian Sun | Jai Gupta | Jonathan Berant | William Cohen | Donald Metzler
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans—copied verbatim from given input sources—and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.

pdf bib
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Yury Zemlyanskiy | Michiel de Jong | Luke Vilnis | Santiago Ontanon | William Cohen | Sumit Sanghai | Joshua Ainslie
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN (de Jong et al., 2023a) pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora.

2023

pdf bib
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval
John Wieting | Jonathan Clark | William Cohen | Graham Neubig | Taylor Berg-Kirkpatrick
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes or careful engineering to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in N languages and, through an approximation we introduce, efficiently encourages source separation in this multilingual setting, separating semantic information that is shared between translations from stylistic or language-specific variation. We show careful large-scale comparisons between contrastive and generation-based approaches for learning multilingual text embeddings, a comparison that has not been done to the best of our knowledge despite the popularity of these approaches. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval - the last of which we introduce in this paper. Overall, our model outperforms both a strong contrastive and generative baseline on these tasks.

pdf bib
FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference
Michiel de Jong | Yury Zemlyanskiy | Joshua Ainslie | Nicholas FitzGerald | Sumit Sanghai | Fei Sha | William Cohen
Findings of the Association for Computational Linguistics: ACL 2023

Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, the architecture used for FiD was chosen by making minimal modifications to a standard T5 model, which our analysis shows to be highly suboptimal for a retrieval-augmented model. In particular, FiD allocates the bulk of FLOPs to the encoder, while the majority of inference time results from memory bandwidth constraints in the decoder. We propose two simple changes to the FiD architecture to alleviate memory bandwidth constraints, and speed up inference by 7x. This allows us to use a much larger decoder at modest cost. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.

pdf bib
WinoDict: Probing language models for in-context word acquisition
Julian Martin Eisenschlos | Jeremy R. Cole | Fangyu Liu | William W. Cohen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We introduce a new in-context learning paradigm to measure Large Language Models’ (LLMs) ability to learn novel words during inference. In particular, we rewrite Winograd-style co-reference resolution problems by replacing the key concept word with a synthetic but plausible word that the model must understand to complete the task. Solving this task requires the model to make use of the dictionary definition of the new word given in the prompt. This benchmark addresses word acquisition, one important aspect of the diachronic degradation known to afflict LLMs. As LLMs are frozen in time at the moment they are trained, they are normally unable to reflect the way language changes over time. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark, thus identifying a limitation of current models and providing a benchmark to measure future improvements in LLMs ability to do in-context learning.

pdf bib
Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering
Wenhu Chen | Pat Verga | Michiel de Jong | John Wieting | William W. Cohen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Existing state-of-the-art methods for open-domain question-answering (ODQA) use an open book approach in which information is first retrieved from a large text corpus or knowledge base (KB) and then reasoned over to produce an answer. A recent alternative is to retrieve from a collection of previously-generated question-answer pairs; this has several practical advantages including being more memory and compute-efficient. Question-answer pairs are also appealing in that they can be viewed as an intermediate between text and KB triples: like KB triples, they often concisely express a single relationship, but like text, have much higher coverage than traditional KBs. In this work, we describe a new QA system that augments a text-to-text model with a large memory of question-answer pairs, and a new pre-training task for the latent step of question retrieval. The pre-training task substantially simplifies training and greatly improves performance on smaller QA benchmarks. Unlike prior systems of this sort, our QA system can also answer multi-hop questions that do not explicitly appear in the collection of stored question-answer pairs.

2022

pdf bib
ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers
Haitian Sun | William Cohen | Ruslan Salakhutdinov
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We describe a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable when certain conditions apply. We call this dataset ConditionalQA. In addition to conditional answers, the dataset also features:(1) long context documents with information that is related in logically complex ways;(2) multi-hop questions that require compositional logical reasoning;(3) a combination of extractive questions, yes/no questions, questions with multiple answers, and not-answerable questions;(4) questions asked without knowing the answers. We show that ConditionalQA is challenging for many of the existing QA models, especially in selecting answer conditions. We believe that this dataset will motivate further research in answering complex questions over long documents.

pdf bib
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
Wenhu Chen | Hexiang Hu | Xi Chen | Pat Verga | William Cohen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While language Models store a massive amount of world knowledge implicitly in their parameters, even very large models often fail to encode information about rare entities and events, while incurring huge computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, have incorporated world knowledge into language generation by leveraging an external non-parametric index and have demonstrated impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images – much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language generation. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. We perform experiments on two different datasets that require retrieving and reasoning over both images and text to answer a given query: WebQA, and MultimodalQA. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets and under both distractor and full-wiki settings.

pdf bib
Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling
Vidhisha Balachandran | Hannaneh Hajishirzi | William Cohen | Yulia Tsvetkov
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Abstractive summarization models often generate inconsistent summaries containing factual errors or hallucinated content. Recent works focus on correcting factual errors in generated summaries via post-editing. Such correction models are trained using adversarial non-factual summaries constructed using heuristic rules for injecting errors. However, generating non-factual summaries using heuristics often does not generalize well to actual model errors. In this work, we propose to generate hard, representative synthetic examples of non-factual summaries through infilling language models. With this data, we train a more robust fact-correction model to post-edit the summaries to improve factual consistency. Through quantitative and qualitative experiments on two popular summarization datasets— CNN/DM and XSum—we show that our approach vastly outperforms prior methods in correcting erroneous summaries. Our model—FactEdit—improves factuality scores by over ~11 points on CNN/DM and over ~31 points on XSum on average across multiple summarization models, producing more factual summaries while maintaining competitive summarization quality.

pdf bib
Time-Aware Language Models as Temporal Knowledge Bases
Bhuwan Dhingra | Jeremy R. Cole | Julian Martin Eisenschlos | Daniel Gillick | Jacob Eisenstein | William W. Cohen
Transactions of the Association for Computational Linguistics, Volume 10

Many facts come with an expiration date, from the name of the President to the basketball team Lebron James plays for. However, most language models (LMs) are trained on snapshots of data collected at a specific moment in time. This can limit their utility, especially in the closed-book setting where the pretraining corpus must contain the facts the model should memorize. We introduce a diagnostic dataset aimed at probing LMs for factual knowledge that changes over time and highlight problems with LMs at either end of the spectrum—those trained on specific slices of temporal data, as well as those trained on a wide range of temporal data. To mitigate these problems, we propose a simple technique for jointly modeling text with its timestamp. This improves memorization of seen facts from the training time period, as well as calibration on predictions about unseen facts from future time periods. We also show that models trained with temporal context can be efficiently “refreshed” as new data arrives, without the need for retraining from scratch.

pdf bib
Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students?
Danish Pruthi | Rachit Bansal | Bhuwan Dhingra | Livio Baldini Soares | Michael Collins | Zachary C. Lipton | Graham Neubig | William W. Cohen
Transactions of the Association for Computational Linguistics, Volume 10

While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the student during training, but are not available at test time. Compared with prior proposals, our approach is less easily gamed, enabling principled, automatic, model-agnostic evaluation of attributions. Using our framework, we compare numerous attribution methods for text classification and question answering, and observe quantitative differences that are consistent (to a moderate to high degree) across different student model architectures and learning strategies.1

2021

pdf bib
Investigating the Effect of Background Knowledge on Natural Questions
Vidhisha Balachandran | Bhuwan Dhingra | Haitian Sun | Michael Collins | William Cohen
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Existing work shows the benefits of integrating KBs with textual evidence for QA only on questions that are answerable by KBs alone (Sun et al., 2019). In contrast, real world QA systems often have to deal with questions that might not be directly answerable by KBs. Here, we investigate the effect of integrating background knowledge from KBs for the Natural Questions (NQ) task. We create a subset of the NQ data, Factual Questions (FQ), where the questions have evidence in the KB in the form of paths that link question entities to answer entities but still must be answered using text, to facilitate further research into KB integration methods. We propose and analyze a simple, model-agnostic approach for incorporating KB paths into text-based QA systems and establish a strong upper bound on FQ for our method using an oracle retriever. We show that several variants of Personalized PageRank based fact retrievers lead to a low recall of answer entities and consequently fail to improve QA performance. Our results suggest that fact retrieval is a bottleneck for integrating KBs into real world QA datasets

pdf bib
Adaptable and Interpretable Neural MemoryOver Symbolic Knowledge
Pat Verga | Haitian Sun | Livio Baldini Soares | William Cohen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Past research has demonstrated that large neural language models (LMs) encode surprising amounts of factual information: however, augmenting or modifying this information requires modifying a corpus and retraining, which is computationally expensive. To address this problem, we develop a neural LM that includes an interpretable neuro-symbolic KB in the form of a “fact memory”. Each element of the fact memory is formed from a triple of vectors, where each vector corresponds to a KB entity or relation. Our LM improves performance on knowledge-intensive question-answering tasks, sometimes dramatically, including a 27 point increase in one setting of WebQuestionsSP over a state-of-the-art open-book model, despite using 5% of the parameters. Most interestingly, we demonstrate that the model can be modified, without any re-training, by updating the fact memory.

pdf bib
Differentiable Open-Ended Commonsense Reasoning
Bill Yuchen Lin | Haitian Sun | Bhuwan Dhingra | Manzil Zaheer | Xiang Ren | William Cohen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Current commonsense reasoning research focuses on developing models that use commonsense knowledge to answer multiple-choice questions. However, systems designed to answer multiple-choice questions may not be useful in applications that do not provide a small list of candidate answers to choose from. As a step towards making commonsense reasoning research more realistic, we propose to study open-ended commonsense reasoning (OpenCSR) — the task of answering a commonsense question without any pre-defined choices — using as a resource only a corpus of commonsense facts written in natural language. OpenCSR is challenging due to a large decision space, and because many questions require implicit multi-hop reasoning. As an approach to OpenCSR, we propose DrFact, an efficient Differentiable model for multi-hop Reasoning over knowledge Facts. To evaluate OpenCSR methods, we adapt several popular commonsense reasoning benchmarks, and collect multiple new answers for each test question via crowd-sourcing. Experiments show that DrFact outperforms strong baseline methods by a large margin.

pdf bib
MATE: Multi-view Attention for Table Transformer Efficiency
Julian Eisenschlos | Maharshi Gor | Thomas Müller | William Cohen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.

2019

pdf bib
Handling Divergent Reference Texts when Evaluating Table-to-Text Generation
Bhuwan Dhingra | Manaal Faruqui | Ankur Parikh | Ming-Wei Chang | Dipanjan Das | William Cohen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.

pdf bib
PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text
Haitian Sun | Tania Bedrax-Weiss | William Cohen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We consider open-domain question answering (QA) where answers are drawn from either a corpus, a knowledge base (KB), or a combination of both of these. We focus on a setting in which a corpus is supplemented with a large but incomplete KB, and on questions that require non-trivial (e.g., “multi-hop”) reasoning. We describe PullNet, an integrated framework for (1) learning what to retrieve and (2) reasoning with this heterogeneous information to find the best answer. PullNet uses an iterative process to construct a question-specific subgraph that contains information relevant to the question. In each iteration, a graph convolutional network (graph CNN) is used to identify subgraph nodes that should be expanded using retrieval (or “pull”) operations on the corpus and/or KB. After the subgraph is complete, another graph CNN is used to extract the answer from the subgraph. This retrieve-and-reason process allows us to answer multi-hop questions using large KBs and corpora. PullNet is weakly supervised, requiring question-answer pairs but not gold inference paths. Experimentally PullNet improves over the prior state-of-the art, and in the setting where a corpus is used with incomplete KB these improvements are often dramatic. PullNet is also often superior to prior systems in a KB-only setting or a text-only setting.

pdf bib
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin | Bhuwan Dhingra | Zhengping Liu | William Cohen | Xinghua Lu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much room for improvement. PubMedQA is publicly available at https://pubmedqa.github.io.

pdf bib
Probing Biomedical Embeddings from Language Models
Qiao Jin | Bhuwan Dhingra | William Cohen | Xinghua Lu
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

Contextualized word embeddings derived from pre-trained language models (LMs) show significant improvements on downstream NLP tasks. Pre-training on domain-specific corpora, such as biomedical articles, further improves their performance. In this paper, we conduct probing experiments to determine what additional information is carried intrinsically by the in-domain trained contextualized embeddings. For this we use the pre-trained LMs as fixed feature extractors and restrict the downstream task models to not have additional sequence modeling layers. We compare BERT (Devlin et al. 2018), ELMo (Peters et al., 2018), BioBERT (Lee et al., 2019) and BioELMo, a biomedical version of ELMo trained on 10M PubMed abstracts. Surprisingly, while fine-tuned BioBERT is better than BioELMo in biomedical NER and NLI tasks, as a fixed feature extractor BioELMo outperforms BioBERT in our probing tasks. We use visualization and nearest neighbor analysis to show that better encoding of entity-type and relational information leads to this superiority.

2018

pdf bib
Neural Models for Reasoning over Multiple Mentions Using Coreference
Bhuwan Dhingra | Qiao Jin | Zhilin Yang | William Cohen | Ruslan Salakhutdinov
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Many problems in NLP require aggregating information from multiple mentions of the same entity which may be far apart in the text. Existing Recurrent Neural Network (RNN) layers are biased towards short-term dependencies and hence not suited to such tasks. We present a recurrent layer which is instead biased towards coreferent dependencies. The layer uses coreference annotations extracted from an external system to connect entity mentions belonging to the same cluster. Incorporating this layer into a state-of-the-art reading comprehension model improves performance on three datasets – Wikihop, LAMBADA and the bAbi AI tasks – with large gains when training data is scarce.

pdf bib
AttentionMeSH: Simple, Effective and Interpretable Automatic MeSH Indexer
Qiao Jin | Bhuwan Dhingra | William Cohen | Xinghua Lu
Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering

There are millions of articles in PubMed database. To facilitate information retrieval, curators in the National Library of Medicine (NLM) assign a set of Medical Subject Headings (MeSH) to each article. MeSH is a hierarchically-organized vocabulary, containing about 28K different concepts, covering the fields from clinical medicine to information sciences. Several automatic MeSH indexing models have been developed to improve the time-consuming and financially expensive manual annotation, including the NLM official tool – Medical Text Indexer, and the winner of BioASQ Task5a challenge – DeepMeSH. However, these models are complex and not interpretable. We propose a novel end-to-end model, AttentionMeSH, which utilizes deep learning and attention mechanism to index MeSH terms to biomedical text. The attention mechanism enables the model to associate textual evidence with annotations, thus providing interpretability at the word level. The model also uses a novel masking mechanism to enhance accuracy and speed. In the final week of BioASQ Chanllenge Task6a, we ranked 2nd by average MiF using an on-construction model. After the contest, we achieve close to state-of-the-art MiF performance of ∼ 0.684 using our final model. Human evaluations show AttentionMeSH also provides high level of interpretability, retrieving about 90% of all expert-labeled relevant words given an MeSH-article pair at 20 output.

pdf bib
Learning to Define Terms in the Software Domain
Vidhisha Balachandran | Dheeraj Rajagopal | Rose Catherine Kanjirathinkal | William Cohen
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

One way to test a person’s knowledge of a domain is to ask them to define domain-specific terms. Here, we investigate the task of automatically generating definitions of technical terms by reading text from the technical domain. Specifically, we learn definitions of software entities from a large corpus built from the user forum Stack Overflow. To model definitions, we train a language model and incorporate additional domain-specific information like word co-occurrence, and ontological category information. Our approach improves previous baselines by 2 BLEU points for the definition generation task. Our experiments also show the additional challenges associated with the task and the short-comings of language-model based architectures for definition generation.

pdf bib
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang | Peng Qi | Saizheng Zhang | Yoshua Bengio | William Cohen | Ruslan Salakhutdinov | Christopher D. Manning
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

pdf bib
Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text
Haitian Sun | Bhuwan Dhingra | Manzil Zaheer | Kathryn Mazaitis | Ruslan Salakhutdinov | William Cohen
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Open Domain Question Answering (QA) is evolving from complex pipelined systems to end-to-end deep neural networks. Specialized neural models have been developed for extracting answers from either text alone or Knowledge Bases (KBs) alone. In this paper we look at a more practical setting, namely QA over the combination of a KB and entity-linked text, which is appropriate when an incomplete KB is available with a large text corpus. Building on recent advances in graph representation learning we propose a novel model, GRAFT-Net, for extracting answers from a question-specific subgraph containing text and KB entities and relations. We construct a suite of benchmark tasks for this problem, varying the difficulty of questions, the amount of training data, and KB completeness. We show that GRAFT-Net is competitive with the state-of-the-art when tested using either KBs or text alone, and vastly outperforms existing methods in the combined setting.

2017

pdf bib
Semi-Supervised QA with Generative Domain-Adaptive Nets
Zhilin Yang | Junjie Hu | Ruslan Salakhutdinov | William Cohen
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study the problem of semi-supervised question answering—utilizing unlabeled text to boost the performance of question answering models. We propose a novel training framework, the Generative Domain-Adaptive Nets. In this framework, we train a generative model to generate questions based on the unlabeled text, and combine model-generated questions with human-generated questions for training question answering models. We develop novel domain adaptation algorithms, based on reinforcement learning, to alleviate the discrepancy between the model-generated data distribution and the human-generated data distribution. Experiments show that our proposed framework obtains substantial improvement from unlabeled text.

pdf bib
Gated-Attention Readers for Text Comprehension
Bhuwan Dhingra | Hanxiao Liu | Zhilin Yang | William Cohen | Ruslan Salakhutdinov
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper we study the problem of answering cloze-style questions over documents. Our model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtains state-of-the-art results on three benchmarks for this task–the CNN & Daily Mail news stories and the Who Did What dataset. The effectiveness of multiplicative interaction is demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention.

2016

pdf bib
Scalable Statistical Relational Learning for NLP
William Yang Wang | William Cohen
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Using Graphs of Classifiers to Impose Constraints on Semi-supervised Relation Extraction
Lidong Bing | William Cohen | Bhuwan Dhingra | Richard Wang
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Tweet2Vec: Character-Based Distributed Representations for Social Media
Bhuwan Dhingra | Zhong Zhou | Dylan Fitzpatrick | Michael Muehl | William Cohen
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib
Improving Distant Supervision for Information Extraction Using Label Propagation Through Lists
Lidong Bing | Sneha Chaudhari | Richard Wang | William Cohen
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning to Identify the Best Contexts for Knowledge-based WSD
Evgenia Wasserman Pritsker | William Cohen | Einat Minkov
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Joint Information Extraction and Reasoning: A Scalable Statistical Relational Learning Approach
William Yang Wang | William W. Cohen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Learning Relational Features with Backward Random Walks
Ni Lao | Einat Minkov | William Cohen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts
Dana Movshovitz-Attias | William W. Cohen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach
William Yang Wang | Lingpeng Kong | Kathryn Mazaitis | William W. Cohen
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Natural Language Models for Predicting Programming Comments
Dana Movshovitz-Attias | William W. Cohen
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
What’s in a Domain? Multi-Domain Learning for Multi-Attribute Data
Mahesh Joshi | Mark Dredze | William W. Cohen | Carolyn P. Rosé
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Crowdsourced Comprehension: Predicting Prerequisite Structure in Wikipedia
Partha Talukdar | William Cohen
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

pdf bib
Bootstrapping Biomedical Ontologies for Scientific Text using NELL
Dana Movshovitz-Attias | William W. Cohen
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf bib
Alignment-HMM-based Extraction of Abbreviations from Biomedical Text
Dana Movshovitz-Attias | William W. Cohen
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf bib
Evaluating Joint Modeling of Yeast Biology Literature and Protein-Protein Interaction Networks
Ramnath Balasubramanyan | Kathryn Rivard | William W. Cohen | Jelena Jakovljevic | John L. Woolford
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf bib
Collectively Representing Semi-Structured Data from the Web
Bhavana Dalvi | William Cohen | Jamie Callan
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Graph Based Similarity Measures for Synonym Extraction from Parsed Text
Einat Minkov | William Cohen
Workshop Proceedings of TextGraphs-7: Graph-based Methods for Natural Language Processing

pdf bib
Reading The Web with Learned Syntactic-Semantic Inference Rules
Ni Lao | Amarnag Subramanya | Fernando Pereira | William W. Cohen
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Multi-Domain Learning: When Do Domains Matter?
Mahesh Joshi | Mark Dredze | William W. Cohen | Carolyn Rosé
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
What pushes their buttons? Predicting comment polarity from the content of political blog posts
Ramnath Balasubramanyan | William W. Cohen | Doug Pierce | David P. Redlawsk
Proceedings of the Workshop on Language in Social Media (LSM 2011)

pdf bib
Structured Databases of Named Entities from Bayesian Nonparametrics
Jacob Eisenstein | Tae Yano | William Cohen | Noah Smith | Eric Xing
Proceedings of the First workshop on Unsupervised Learning in NLP

pdf bib
Random Walk Inference and Learning in A Large Scale Knowledge Base
Ni Lao | Tom Mitchell | William W. Cohen
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Predicting Response to Political Blog Posts with Topic Models
Tae Yano | William W. Cohen | Noah A. Smith
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Character-level Analysis of Semi-Structured Documents for Set Expansion
Richard C. Wang | William W. Cohen
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Automatic Set Instance Extraction using the Web
Richard C. Wang | William W. Cohen
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
Andrew Arnold | Ramesh Nallapati | William W. Cohen
Proceedings of ACL-08: HLT

pdf bib
Learning Graph Walk Based Similarity Measures for Parsed Text
Einat Minkov | William W. Cohen
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Automatic Set Expansion for List Question Answering
Richard C. Wang | Nico Schlaefer | William W. Cohen | Eric Nyberg
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2006

pdf bib
NER Systems that Suit User’s Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction
Einat Minkov | Richard Wang | Anthony Tomasic | William Cohen
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
A Graph-Search Framework for GeneId Ranking
William Cohen
Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology

pdf bib
Improving “Email Speech Acts” Analysis via N-gram Selection
Vitor Carvalho | William Cohen
Proceedings of the Analyzing Conversations in Text and Speech

pdf bib
A Graphical Framework for Contextual Search and Name Disambiguation in Email
Einat Minkov | William Cohen | Andrew Ng
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

2005

pdf bib
Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text
Einat Minkov | Richard C. Wang | William W. Cohen
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Learning to Classify Email into “Speech Acts”
William W. Cohen | Vitor R. Carvalho | Tom M. Mitchell
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2001

pdf bib
Issues in Extracting Information from the Web
William W. Cohen
Proceedings of the Seventh International Workshop on Parsing Technologies

Search
Co-authors