Hannaneh Hajishirzi


2022

pdf bib
Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks
Akari Asai | Matt Gardner | Hannaneh Hajishirzi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open-domain question answering and fact verification. These models are trained to generate a final output given retrieved passages that can be irrelevant to an input query, leading to learning spurious cues or memorization. This work introduces a method to incorporate evidentiality of passages—whether a passage contains correct evidence to support the output—into training the generator. We introduce a multi-task learning framework to jointly generate the final output and predict the evidentiality of each passage. Furthermore, we introduce a new task-agnostic method for obtaining high-quality silver evidentiality labels, addressing the issues of gold evidentiality labels being unavailable in most domains. Our experiments on five datasets across three knowledge-intensive tasks show that our new evidentiality-guided generator significantly outperforms its direct counterpart on all of them, and advances the state of the art on three of them. Our analysis shows that multi-task learning and silver evidentiality mining play key roles. Our code is available at https://github.com/AkariAsai/evidentiality_qa

pdf bib
MetaICL: Learning to Learn In Context
Sewon Min | Mike Lewis | Luke Zettlemoyer | Hannaneh Hajishirzi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We introduce MetaICL (Meta-training for In-Context Learning), a new meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learning on a large set of training tasks. This meta-training enables the model to more effectively learn a new task in context at test time, by simply conditioning on a few training examples with no parameter updates or task-specific templates. We experiment on a large, diverse collection of tasks consisting of 142 NLP datasets including classification, question answering, natural language inference, paraphrase detection and more, across seven different meta-training/target splits. MetaICL outperforms a range of baselines including in-context learning without meta-training and multi-task learning followed by zero-shot transfer. We find that the gains are particularly significant for target tasks that have domain shifts from the meta-training tasks, and that using a diverse set of the meta-training tasks is key to improvements. We also show that MetaICL approaches (and sometimes beats) the performance of models fully finetuned on the target task training data, and outperforms much bigger models with nearly 8x parameters.

pdf bib
Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts
Daniel Khashabi | Xinxi Lyu | Sewon Min | Lianhui Qin | Kyle Richardson | Sean Welleck | Hannaneh Hajishirzi | Tushar Khot | Ashish Sabharwal | Sameer Singh | Yejin Choi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a “wayward” behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (e.g., definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.

pdf bib
Aligning to Social Norms and Values in Interactive Narratives
Prithviraj Ammanabrolu | Liwei Jiang | Maarten Sap | Hannaneh Hajishirzi | Yejin Choi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We focus on creating agents that act in alignment with socially beneficial norms and values in interactive narratives or text-based games—environments wherein an agent perceives and interacts with a world through natural language. Such interactive agents are often trained via reinforcement learning to optimize task performance, even when such rewards may lead to agent behaviors that violate societal norms—causing harm either to the agent itself or other entities in the environment. Social value alignment refers to creating agents whose behaviors conform to expected moral and social norms for a given context and group of people—in our case, it means agents that behave in a manner that is less harmful and more beneficial for themselves and others.We build on the Jiminy Cricket benchmark (Hendrycks et al. 2021), a set of 25 annotated interactive narratives containing thousands of morally salient scenarios covering everything from theft and bodily harm to altruism. We introduce the GALAD (Game-value ALignment through Action Distillation) agent that uses the social commonsense knowledge present in specially trained language models to contextually restrict its action space to only those actions that are aligned with socially beneficial values. An experimental study shows that the GALAD agent makes decisions efficiently enough to improve state-of-the-art task performance by 4% while reducing the frequency of socially harmful behaviors by 25% compared to strong contemporary value alignment approaches.

pdf bib
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations
Hannaneh Hajishirzi | Qiang Ning | Avi Sil
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

pdf bib
Retrieval Data Augmentation Informed by Downstream Question Answering Performance
James Ferguson | Hannaneh Hajishirzi | Pradeep Dasigi | Tushar Khot
Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER)

Training retrieval models to fetch contexts for Question Answering (QA) over large corpora requires labeling relevant passages in those corpora. Since obtaining exhaustive manual annotations of all relevant passages is not feasible, prior work uses text overlap heuristics to find passages that are likely to contain the answer, but this is not feasible when the task requires deeper reasoning and answers are not extractable spans (e.g.: multi-hop, discrete reasoning). We address this issue by identifying relevant passages based on whether they are useful for a trained QA model to arrive at the correct answers, and develop a search process guided by the QA model’s loss. Our experiments show that this approach enables identifying relevant context for unseen data greater than 90% of the time on the IIRC dataset and generalizes better to the end QA task than those trained on just the gold retrieval data on IIRC and QASC datasets.

pdf bib
Generated Knowledge Prompting for Commonsense Reasoning
Jiacheng Liu | Alisa Liu | Ximing Lu | Sean Welleck | Peter West | Ronan Le Bras | Yejin Choi | Hannaneh Hajishirzi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models. To investigate this question, we develop generated knowledge prompting, which consists of generating knowledge from a language model, then providing the knowledge as additional input when answering a question. Our method does not require task-specific supervision for knowledge integration, or access to a structured knowledge base, yet it improves performance of large-scale, state-of-the-art models on four commonsense reasoning tasks, achieving state-of-the-art results on numerical commonsense (NumerSense), general commonsense (CommonsenseQA 2.0), and scientific commonsense (QASC) benchmarks. Generated knowledge prompting highlights large-scale language models as flexible sources of external knowledge for improving commonsense reasoning.Our code is available at github.com/liujch1998/GKP

pdf bib
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Swaroop Mishra | Daniel Khashabi | Chitta Baral | Hannaneh Hajishirzi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.

pdf bib
FaVIQ: FAct Verification from Information-seeking Questions
Jungsoo Park | Sewon Min | Jaewoo Kang | Luke Zettlemoyer | Hannaneh Hajishirzi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite significant interest in developing general purpose fact checking models, it is challenging to construct a large-scale fact verification dataset with realistic real-world claims. Existing claims are either authored by crowdworkers, thereby introducing subtle biases thatare difficult to control for, or manually verified by professional fact checkers, causing them to be expensive and limited in scale. In this paper, we construct a large-scale challenging fact verification dataset called FAVIQ, consisting of 188k claims derived from an existing corpus of ambiguous information-seeking questions. The ambiguities in the questions enable automatically constructing true and false claims that reflect user confusions (e.g., the year of the movie being filmed vs. being released). Claims in FAVIQ are verified to be natural, contain little lexical bias, and require a complete understanding of the evidence for verification. Our experiments show that the state-of-the-art models are far from solving our new task. Moreover, training on our data helps in professional fact-checking, outperforming models trained on the widely used dataset FEVER or in-domain data by up to 17% absolute. Altogether, our data will serve as a challenging benchmark for natural language understanding and support future progress in professional fact checking.

pdf bib
Noisy Channel Language Model Prompting for Few-Shot Text Classification
Sewon Min | Mike Lewis | Hannaneh Hajishirzi | Luke Zettlemoyer
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input (referred as direct models), channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters, via either in-context demonstration or prompt tuning. Our experiments show that, for both methods, channel models significantly outperform their direct counterparts, which we attribute to their stability, i.e., lower variance and higher worst-case accuracy. We also present extensive ablations that provide recommendations for when to use channel prompt tuning instead of other competitive models (e.g., direct head tuning): channel prompt tuning is preferred when the number of training examples is small, labels in the training data are imbalanced, or generalization to unseen labels is required.

pdf bib
Reframing Instructional Prompts to GPTk’s Language
Swaroop Mishra | Daniel Khashabi | Chitta Baral | Yejin Choi | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: ACL 2022

What kinds of instructional prompts are easier to follow for Language Models (LMs)? We study this question by conducting extensive empirical analysis that shed light on important features of successful instructional prompts. Specifically, we study several classes of reframing techniques for manual reformulation of prompts into more effective ones. Some examples include decomposing a complex task instruction into multiple simpler tasks or itemizing instructions into sequential steps. Our experiments compare the zero-shot and few-shot performance of LMs prompted with reframed instructions on 12 NLP tasks across 6 categories. Compared with original instructions, our reframed instructions lead to significant improvements across LMs with different sizes. For example, the same reframed prompts boost few-shot performance of GPT3-series and GPT2-series by 12.5% and 6.7% respectively averaged over all tasks. Furthermore, reframed instructions reduce the number of examples required to prompt LMs in the few-shot setting. We hope these empirically-driven techniques will pave the way towards more effective future prompting algorithms.

pdf bib
MultiVerS: Improving scientific claim verification with weak supervision and full-document context
David Wadden | Kyle Lo | Lucy Wang | Arman Cohan | Iz Beltagy | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: NAACL 2022

The scientific claim verification task requires an NLP system to label scientific documents which Support or Refute an input claim, and to select evidentiary sentences (or rationales) justifying each predicted label. In this work, we present MultiVerS, which predicts a fact-checking label and identifies rationales in a multitask fashion based on a shared encoding of the claim and full document context. This approach accomplishes two key modeling goals. First, it ensures that all relevant contextual information is incorporated into each labeling decision. Second, it enables the model to learn from instances annotated with a document-level fact-checking label, but lacking sentence-level rationales. This allows MultiVerS to perform weakly-supervised domain adaptation by training on scientific documents labeled using high-precision heuristics. Our approach outperforms two competitive baselines on three scientific claim verification datasets, with particularly strong performance in zero / few-shot domain adaptation experiments. Our code and data are available at https://github.com/dwadden/multivers.

2021

pdf bib
Efficient Passage Retrieval with Hashing for Open-domain Question Answering
Ikuya Yamada | Akari Asai | Hannaneh Hajishirzi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Most state-of-the-art open-domain question answering systems use a neural retrieval model to encode passages into continuous vectors and extract them from a knowledge source. However, such retrieval models often require large memory to run because of the massive size of their passage index. In this paper, we introduce Binary Passage Retriever (BPR), a memory-efficient neural retrieval model that integrates a learning-to-hash technique into the state-of-the-art Dense Passage Retriever (DPR) to represent the passage index using compact binary codes rather than continuous vectors. BPR is trained with a multi-task objective over two tasks: efficient candidate generation based on binary codes and accurate reranking based on continuous vectors. Compared with DPR, BPR substantially reduces the memory cost from 65GB to 2GB without a loss of accuracy on two standard open-domain question answering benchmarks: Natural Questions and TriviaQA. Our code and trained models are available at https://github.com/studio-ousia/bpr.

pdf bib
Prompting Contrastive Explanations for Commonsense Reasoning Tasks
Bhargavi Paranjape | Julian Michael | Marjan Ghazvininejad | Hannaneh Hajishirzi | Luke Zettlemoyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
GooAQ: Open Question Answering with Diverse Answer Types
Daniel Khashabi | Amos Ng | Tushar Khot | Ashish Sabharwal | Hannaneh Hajishirzi | Chris Callison-Burch
Findings of the Association for Computational Linguistics: EMNLP 2021

While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google’s responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmark T5 models on GooAQ and observe that: (a) in line with recent work, LM’s strong performance on GooAQ’s short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as ‘how’ and ‘why’ questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types.

pdf bib
Probing Across Time: What Does RoBERTa Know and When?
Zeyu Liu | Yizhong Wang | Jungo Kasai | Hannaneh Hajishirzi | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2021

Models of language trained on very large corpora have been demonstrated useful for natural language processing. As fixed artifacts, they have become the object of intense study, with many researchers “probing” the extent to which they acquire and readily demonstrate linguistic abstractions, factual and commonsense knowledge, and reasoning abilities. Recent work applied several probes to intermediate training stages to observe the developmental process of a large-scale model (Chiang et al., 2020). Following this effort, we systematically answer a question: for various types of knowledge a language model learns, when during (pre)training are they acquired? Using RoBERTa as a case study, we find: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.

pdf bib
XOR QA: Cross-lingual Open-Retrieval Question Answering
Akari Asai | Jungo Kasai | Jonathan Clark | Kenton Lee | Eunsol Choi | Hannaneh Hajishirzi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcity—where languages have few reference articles—and information asymmetry—where questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questions from one language to be answered via answer content from another language. We construct a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for. Based on this dataset, we introduce a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingual document retrieval from multilingual and English resources. We establish baselines with state-of-the-art machine translation systems and cross-lingual pretrained models. Experimental results suggest that XOR QA is a challenging task that will facilitate the development of novel techniques for multilingual question answering. Our data and code are available at https://nlp.cs.washington.edu/xorqa/.

pdf bib
Extracting a Knowledge Base of Mechanisms from COVID-19 Papers
Tom Hope | Aida Amini | David Wadden | Madeleine van Zuylen | Sravanthi Parasa | Eric Horvitz | Daniel Weld | Roy Schwartz | Hannaneh Hajishirzi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The COVID-19 pandemic has spawned a diverse body of scientific literature that is challenging to navigate, stimulating interest in automated tools to help find useful knowledge. We pursue the construction of a knowledge base (KB) of mechanisms—a fundamental concept across the sciences, which encompasses activities, functions and causal relations, ranging from cellular processes to economic impacts. We extract this information from the natural language of scientific papers by developing a broad, unified schema that strikes a balance between relevance and breadth. We annotate a dataset of mechanisms with our schema and train a model to extract mechanism relations from papers. Our experiments demonstrate the utility of our KB in supporting interdisciplinary scientific search over COVID-19 literature, outperforming the prominent PubMed search in a study with clinical experts. Our search engine, dataset and code are publicly available.

pdf bib
Probing Contextual Language Models for Common Ground with Visual Representations
Gabriel Ilharco | Rowan Zellers | Ali Farhadi | Hannaneh Hajishirzi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.

pdf bib
Beyond Paragraphs: NLP for Long Sequences
Iz Beltagy | Arman Cohan | Hannaneh Hajishirzi | Sewon Min | Matthew E. Peters
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials

In this tutorial, we aim at bringing interested NLP researchers up to speed about the recent and ongoing techniques for document-level representation learning. Additionally, our goal is to reveal new research opportunities to the audience, which will hopefully bring us closer to address existing challenges in this domain.

pdf bib
DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization
Zeqiu Wu | Bo-Ru Lu | Hannaneh Hajishirzi | Mari Ostendorf
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Identifying relevant knowledge to be used in conversational systems that are grounded in long documents is critical to effective response generation. We introduce a knowledge identification model that leverages the document structure to provide dialogue-contextualized passage encodings and better locate knowledge relevant to the conversation. An auxiliary loss captures the history of dialogue-document connections. We demonstrate the effectiveness of our model on two document-grounded conversational datasets and provide analyses showing generalization to unseen documents and long dialogue contexts.

pdf bib
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
Christopher Clark | Jordi Salvador | Dustin Schwenk | Derrick Bonafilia | Mark Yatskar | Eric Kolve | Alvaro Herrasti | Jonghyun Choi | Sachin Mehta | Sam Skjonsberg | Carissa Schoenick | Aaron Sarnat | Hannaneh Hajishirzi | Aniruddha Kembhavi | Oren Etzioni | Ali Farhadi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multi-modal gestures (e.g., pointing with a finger, or an arrow in a diagram). We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challenge for the research community. In Iconary, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response. This back-and-forth often uses canonical scenes, visual metaphor, or icon compositions to express challenging words, making it an ideal test for mixing language and visual/symbolic communication in AI. We propose models to play Iconary and train them on over 55,000 games between human players. Our models are skillful players and are able to employ world knowledge in language models to play with words unseen during training.

pdf bib
Joint Passage Ranking for Diverse Multi-Answer Retrieval
Sewon Min | Kenton Lee | Ming-Wei Chang | Kristina Toutanova | Hannaneh Hajishirzi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question. This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer. Prior work focusing on single-answer retrieval is limited as it cannot reason about the set of passages jointly. In this paper, we introduce JPR, a joint passage retrieval model focusing on reranking. To model the joint probability of the retrieved passages, JPR makes use of an autoregressive reranker that selects a sequence of passages, equipped with novel training and decoding algorithms. Compared to prior approaches, JPR achieves significantly better answer coverage on three multi-answer datasets. When combined with downstream question answering, the improved retrieval enables larger answer generation models since they need to consider fewer passages, establishing a new state-of-the-art.

2020

pdf bib
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
James Ferguson | Matt Gardner | Hannaneh Hajishirzi | Tushar Khot | Pradeep Dasigi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system’s performance at identifying a potential lack of sufficient information and locating sources for that information. To fill this gap, we present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents. The questions were written by crowd workers who did not have access to any of the linked documents, leading to questions that have little lexical overlap with the contexts where the answers appear. This process also gave many questions without answers, and those that require discrete reasoning, increasing the difficulty of the task. We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%. The dataset, code for the baseline system, and a leaderboard can be found at https://allennlp.org/iirc.

pdf bib
An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction
Bhargavi Paranjape | Mandar Joshi | John Thickstun | Hannaneh Hajishirzi | Luke Zettlemoyer
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Decisions of complex models for language understanding can be explained by limiting the inputs they are provided to a relevant subsequence of the original text — a rationale. Models that condition predictions on a concise rationale, while being more interpretable, tend to be less accurate than models that are able to use the entire context. In this paper, we show that it is possible to better manage the trade-off between concise explanations and high task accuracy by optimizing a bound on the Information Bottleneck (IB) objective. Our approach jointly learns an explainer that predicts sparse binary masks over input sentences without explicit supervision, and an end-task predictor that considers only the residual sentences. Using IB, we derive a learning objective that allows direct control of mask sparsity levels through a tunable sparse prior. Experiments on the ERASER benchmark demonstrate significant gains over previous work for both task performance and agreement with human rationales. Furthermore, we find that in the semi-supervised setting, a modest amount of gold rationales (25% of training examples with gold masks) can close the performance gap with a model that uses the full input.

pdf bib
AmbigQA: Answering Ambiguous Open-domain Questions
Sewon Min | Julian Michael | Hannaneh Hajishirzi | Luke Zettlemoyer
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Ambiguity is inherent to open-domain question answering; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer. In this paper, we introduce AmbigQA, a new open-domain question answering task which involves finding every plausible answer, and then rewriting the question for each one to resolve the ambiguity. To study this task, we construct AmbigNQ, a dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark. We find that over half of the questions in NQ-open are ambiguous, with diverse sources of ambiguity such as event and entity references. We also present strong baseline models for AmbigQA which we show benefit from weakly supervised learning that incorporates NQ-open, strongly suggesting our new task and data will support significant future research effort. Our data and baselines are available at https://nlp.cs.washington.edu/ambigqa.

pdf bib
Fact or Fiction: Verifying Scientific Claims
David Wadden | Shanchuan Lin | Kyle Lo | Lucy Lu Wang | Madeleine van Zuylen | Arman Cohan | Hannaneh Hajishirzi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SciFact will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at https://github.com/allenai/scifact. A leaderboard and COVID-19 fact-checking demo are available at https://scifact.apps.allenai.org.

pdf bib
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho | Jiasen Lu | Dustin Schwenk | Hannaneh Hajishirzi | Aniruddha Kembhavi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family – LXMERT – finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERT’s image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.

pdf bib
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Swabha Swayamdipta | Roy Schwartz | Nicholas Lourie | Yizhong Wang | Hannaneh Hajishirzi | Noah A. Smith | Yejin Choi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps—a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example—the model’s confidence in the true class, and the variability of this confidence across epochs—obtained in a single run of training. Experiments on four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of “ambiguous” regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are “easy to learn” for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds “hard to learn”; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

pdf bib
Contextualized Sparse Representations for Real-Time Open-Domain Question Answering
Jinhyuk Lee | Minjoon Seo | Hannaneh Hajishirzi | Jaewoo Kang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Open-domain question answering can be formulated as a phrase retrieval problem, in which we can expect huge scalability and speed benefit but often suffer from low accuracy due to the limitation of existing phrase representation models. In this paper, we aim to improve the quality of each phrase embedding by augmenting it with a contextualized sparse representation (Sparc). Unlike previous sparse vectors that are term-frequency-based (e.g., tf-idf) or directly learned (only few thousand dimensions), we leverage rectified self-attention to indirectly learn sparse vectors in n-gram vocabulary space. By augmenting the previous phrase retrieval model (Seo et al., 2019) with Sparc, we show 4%+ improvement in CuratedTREC and SQuAD-Open. Our CuratedTREC score is even better than the best known retrieve & read model with at least 45x faster inference speed.

pdf bib
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering
Akari Asai | Hannaneh Hajishirzi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Many natural language questions require qualitative, quantitative or logical comparisons between two entities or events. This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions by integrating logic rules and neural models. Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model. Improving the global consistency of predictions, our approach achieves large improvements over previous methods in a variety of question answering (QA) tasks, including multiple-choice qualitative reasoning, cause-effect reasoning, and extractive machine reading comprehension. In particular, our method significantly improves the performance of RoBERTa-based models by 1-5% across datasets. We advance state of the art by around 5-8% on WIQA and QuaRel and reduce consistency violations by 58% on HotpotQA. We further demonstrate that our approach can learn effectively from limited data.

pdf bib
SciREX: A Challenge Dataset for Document-Level Information Extraction
Sarthak Jain | Madeleine van Zuylen | Hannaneh Hajishirzi | Iz Beltagy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX .

pdf bib
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages
Colin Lockard | Prashant Shiralkar | Xin Luna Dong | Hannaneh Hajishirzi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In many documents, such as semi-structured webpages, textual semantics are augmented with additional information conveyed using visual elements including layout, font size, and color. Prior work on information extraction from semi-structured websites has required learning an extraction model specific to a given template via either manually labeled or distantly supervised data from that template. In this work, we propose a solution for “zero-shot” open-domain relation extraction from webpages with a previously unseen template, including from websites with little overlap with existing sources of knowledge for distant supervision and websites in entirely new subject verticals. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage and the relationships between them, enabling generalization to new templates. Experiments show this approach provides a 31% F1 gain over a baseline for zero-shot extraction in a new subject vertical.

pdf bib
Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web
Xin Luna Dong | Hannaneh Hajishirzi | Colin Lockard | Prashant Shiralkar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The World Wide Web contains vast quantities of textual information in several forms: unstructured text, template-based semi-structured webpages (which present data in key-value pairs and lists), and tables. Methods for extracting information from these sources and converting it to a structured form have been a target of research from the natural language processing (NLP), data mining, and database communities. While these researchers have largely separated extraction from web data into different problems based on the modality of the data, they have faced similar problems such as learning with limited labeled data, defining (or avoiding defining) ontologies, making use of prior knowledge, and scaling solutions to deal with the size of the Web. In this tutorial we take a holistic view toward information extraction, exploring the commonalities in the challenges and solutions developed to address these different forms of text. We will explore the approaches targeted at unstructured text that largely rely on learning syntactic or semantic textual patterns, approaches targeted at semi-structured documents that learn to identify structural patterns in the template, and approaches targeting web tables which rely heavily on entity linking and type information. While these different data modalities have largely been considered separately in the past, recent research has started taking a more inclusive approach toward textual extraction, in which the multiple signals offered by textual, layout, and visual clues are combined into a single extraction model made possible by new deep learning approaches. At the same time, trends within purely textual extraction have shifted toward full-document understanding rather than considering sentences as independent units. With this in mind, it is worth considering the information extraction problem as a whole to motivate solutions that harness textual semantics along with visual and semi-structured layout information. We will discuss these approaches and suggest avenues for future work.

pdf bib
Proceedings of the 5th Workshop on Representation Learning for NLP
Spandana Gella | Johannes Welbl | Marek Rei | Fabio Petroni | Patrick Lewis | Emma Strubell | Minjoon Seo | Hannaneh Hajishirzi
Proceedings of the 5th Workshop on Representation Learning for NLP

pdf bib
Evaluating Models’ Local Decision Boundaries via Contrast Sets
Matt Gardner | Yoav Artzi | Victoria Basmov | Jonathan Berant | Ben Bogin | Sihao Chen | Pradeep Dasigi | Dheeru Dua | Yanai Elazar | Ananth Gottumukkala | Nitish Gupta | Hannaneh Hajishirzi | Gabriel Ilharco | Daniel Khashabi | Kevin Lin | Jiangming Liu | Nelson F. Liu | Phoebe Mulcaire | Qiang Ning | Sameer Singh | Noah A. Smith | Sanjay Subramanian | Reut Tsarfaty | Eric Wallace | Ally Zhang | Ben Zhou
Findings of the Association for Computational Linguistics: EMNLP 2020

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

pdf bib
UNIFIEDQA: Crossing Format Boundaries with a Single QA System
Daniel Khashabi | Sewon Min | Tushar Khot | Ashish Sabharwal | Oyvind Tafjord | Peter Clark | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: EMNLP 2020

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats. UNIFIEDQA performs on par with 8 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UNIFIEDQA performs surprisingly well, showing strong generalization from its outof-format training data. Finally, simply finetuning this pre trained QA model into specialized models results in a new state of the art on 10 factoid and commonsense question answering datasets, establishing UNIFIEDQA as a strong starting point for building QA systems.

pdf bib
MedICaT: A Dataset of Medical Images, Captions, and Textual References
Sanjay Subramanian | Lucy Lu Wang | Ben Bogin | Sachin Mehta | Madeleine van Zuylen | Sravanthi Parasa | Sameer Singh | Matt Gardner | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: EMNLP 2020

Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.

2019

pdf bib
Text Generation from Knowledge Graphs with Graph Transformers
Rik Koncel-Kedziorski | Dhanush Bekal | Yi Luan | Mirella Lapata | Hannaneh Hajishirzi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Generating texts which express complex ideas spanning multiple sentences requires a structured representation of their content (document plan), but these representations are prohibitively expensive to manually produce. In this work, we address the problem of generating coherent multi-sentence texts from the output of an information extraction system, and in particular a knowledge graph. Graphical knowledge representations are ubiquitous in computing, but pose a significant challenge for text generation techniques due to their non-hierarchical nature, collapsing of long-distance dependencies, and structural variety. We introduce a novel graph transforming encoder which can leverage the relational structure of such knowledge graphs without imposing linearization or hierarchical constraints. Incorporated into an encoder-decoder setup, we provide an end-to-end trainable system for graph-to-text generation that we apply to the domain of scientific text. Automatic and human evaluations show that our technique produces more informative texts which exhibit better document structure than competitive encoder-decoder methods.

pdf bib
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Aida Amini | Saadia Gabriel | Shanchuan Lin | Rik Koncel-Kedziorski | Yejin Choi | Hannaneh Hajishirzi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce a large-scale dataset of math word problems and an interpretable neural math problem solver by learning to map problems to their operation programs. Due to annotation challenges, current datasets in this domain have been either relatively small in scale or did not offer precise operational annotations over diverse problem types. We introduce a new representation language to model operation programs corresponding to each math problem that aim to improve both the performance and the interpretability of the learned models. Using this representation language, we significantly enhance the AQUA-RAT dataset with fully-specified operational programs. We additionally introduce a neural sequence-to-program model with automatic problem categorization. Our experiments show improvements over competitive baselines in our dataset as well as the AQUA-RAT dataset. The results are still lower than human performance indicating that the dataset poses new challenges for future research. Our dataset is available at https://math-qa.github.io/math-QA/

pdf bib
A general framework for information extraction using dynamic span graphs
Yi Luan | Dave Wadden | Luheng He | Amy Shah | Mari Ostendorf | Hannaneh Hajishirzi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce a general framework for several information extraction tasks that share span representations using dynamically constructed span graphs. The graphs are dynamically constructed by selecting the most confident entity spans and linking these nodes with confidence-weighted relation types and coreferences. The dynamic span graph allow coreference and relation type confidences to propagate through the graph to iteratively refine the span representations. This is unlike previous multi-task frameworks for information extraction in which the only interaction between tasks is in the shared first-layer LSTM. Our framework significantly outperforms state-of-the-art on multiple information extraction tasks across multiple datasets reflecting different domains. We further observe that the span enumeration approach is good at detecting nested span entities, with significant F1 score improvement on the ACE dataset.

pdf bib
SemEval-2019 Task 10: Math Question Answering
Mark Hopkins | Ronan Le Bras | Cristian Petrescu-Prahova | Gabriel Stanovsky | Hannaneh Hajishirzi | Rik Koncel-Kedziorski
Proceedings of the 13th International Workshop on Semantic Evaluation

We report on the SemEval 2019 task on math question answering. We provided a question set derived from Math SAT practice exams, including 2778 training questions and 1082 test questions. For a significant subset of these questions, we also provided SMT-LIB logical form annotations and an interpreter that could solve these logical forms. Systems were evaluated based on the percentage of correctly answered questions. The top system correctly answered 45% of the test questions, a considerable improvement over the 17% random guessing baseline.

pdf bib
A Discrete Hard EM Approach for Weakly Supervised Question Answering
Sewon Min | Danqi Chen | Hannaneh Hajishirzi | Luke Zettlemoyer
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many question answering (QA) tasks only provide weak supervision for how the answer should be computed. For example, TriviaQA answers are entities that can be mentioned multiple times in supporting documents, while DROP answers can be computed by deriving many different equations from numbers in the reference text. In this paper, we show it is possible to convert such tasks into discrete latent variable learning problems with a precomputed, task-specific set of possible solutions (e.g. different mentions or equations) that contains one correct option. We then develop a hard EM learning scheme that computes gradients relative to the most likely solution at each update. Despite its simplicity, we show that this approach significantly outperforms previous methods on six QA tasks, including absolute gains of 2–10%, and achieves the state-of-the-art on five of them. Using hard updates instead of maximizing marginal likelihood is key to these results as it encourages the model to find the one correct answer, which we show through detailed qualitative analysis.

pdf bib
Mixture Content Selection for Diverse Sequence Generation
Jaemin Cho | Minjoon Seo | Hannaneh Hajishirzi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating diverse sequences is important in many NLP applications such as question generation or summarization that exhibit semantically one-to-many relationships between source and the target sequences. We present a method to explicitly separate diversification from generation using a general plug-and-play module (called SELECTOR) that wraps around and guides an existing encoder-decoder model. The diversification stage uses a mixture of experts to sample different binary masks on the source sequence for diverse content selection. The generation stage uses a standard encoder-decoder model given each selected content from the source sequence. Due to the non-differentiable nature of discrete sampling and the lack of ground truth labels for binary mask, we leverage a proxy for ground truth mask and adopt stochastic hard-EM for training. In question generation (SQuAD) and abstractive summarization (CNN-DM), our method demonstrates significant improvements in accuracy, diversity and training efficiency, including state-of-the-art top-1 accuracy in both datasets, 6% gain in top-5 accuracy, and 3.7 times faster training over a state-of-the-art model. Our code is publicly available at https://github.com/clovaai/FocusSeq2Seq.

pdf bib
Entity, Relation, and Event Extraction with Contextualized Span Representations
David Wadden | Ulme Wennberg | Yi Luan | Hannaneh Hajishirzi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We examine the capabilities of a unified, multi-task framework for three information extraction tasks: named entity recognition, relation extraction, and event extraction. Our framework (called DyGIE++) accomplishes all tasks by enumerating, refining, and scoring text spans designed to capture local (within-sentence) and global (cross-sentence) context. Our framework achieves state-of-the-art results across all tasks, on four datasets from a variety of domains. We perform experiments comparing different techniques to construct span representations. Contextualized embeddings like BERT perform well at capturing relationships among entities in the same or adjacent sentences, while dynamic span graph updates model long-range cross-sentence relationships. For instance, propagating span representations via predicted coreference links can enable the model to disambiguate challenging entity mentions. Our code is publicly available at https://github.com/dwadden/dygiepp and can be easily adapted for new tasks or datasets.

pdf bib
On Making Reading Comprehension More Comprehensive
Matt Gardner | Jonathan Berant | Hannaneh Hajishirzi | Alon Talmor | Sewon Min
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Machine reading comprehension, the task of evaluating a machine’s ability to comprehend a passage of text, has seen a surge in popularity in recent years. There are many datasets that are targeted at reading comprehension, and many systems that perform as well as humans on some of these datasets. Despite all of this interest, there is no work that systematically defines what reading comprehension is. In this work, we justify a question answering approach to reading comprehension and describe the various kinds of questions one might use to more fully test a system’s comprehension of a passage, moving beyond questions that only probe local predicate-argument structures. The main pitfall of this approach is that questions can easily have surface cues or other biases that allow a model to shortcut the intended reasoning process. We discuss ways proposed in current literature to mitigate these shortcuts, and we conclude with recommendations for future dataset collection efforts.

pdf bib
Compositional Questions Do Not Necessitate Multi-hop Reasoning
Sewon Min | Eric Wallace | Sameer Singh | Matt Gardner | Hannaneh Hajishirzi | Luke Zettlemoyer
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1—comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.

pdf bib
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index
Minjoon Seo | Jinhyuk Lee | Tom Kwiatkowski | Ankur Parikh | Ali Farhadi | Hannaneh Hajishirzi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Existing open-domain question answering (QA) models are not suitable for real-time usage because they need to process several long documents on-demand for every input query, which is computationally prohibitive. In this paper, we introduce query-agnostic indexable representations of document phrases that can drastically speed up open-domain QA. In particular, our dense-sparse phrase encoding effectively captures syntactic, semantic, and lexical information of the phrases and eliminates the pipeline filtering of context documents. Leveraging strategies for optimizing training and inference time, our model can be trained and deployed even in a single 4-GPU server. Moreover, by representing phrases as pointers to their start and end tokens, our model indexes phrases in the entire English Wikipedia (up to 60 billion phrases) using under 2TB. Our experiments on SQuAD-Open show that our model is on par with or more accurate than previous models with 6000x reduced computational cost, which translates into at least 68x faster end-to-end inference benchmark on CPUs. Code and demo are available at nlp.cs.washington.edu/denspi

pdf bib
Multi-hop Reading Comprehension through Question Decomposition and Rescoring
Sewon Min | Victor Zhong | Luke Zettlemoyer | Hannaneh Hajishirzi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Multi-hop Reading Comprehension (RC) requires reasoning and aggregation across several paragraphs. We propose a system for multi-hop RC that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models. Since annotations for such decomposition are expensive, we recast subquestion generation as a span prediction problem and show that our method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-questions. We also introduce a new global rescoring approach that considers each decomposition (i.e. the sub-questions and their answers) to select the best final answer, greatly improving overall performance. Our experiments on HotpotQA show that this approach achieves the state-of-the-art results, while providing explainable evidence for its decision making in the form of sub-questions.

2018

pdf bib
The UWNLP system at SemEval-2018 Task 7: Neural Relation Extraction Model with Selectively Incorporated Concept Embeddings
Yi Luan | Mari Ostendorf | Hannaneh Hajishirzi
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our submission for SemEval 2018 Task 7 shared task on semantic relation extraction and classification in scientific papers. Our model is based on the end-to-end relation extraction model of (Miwa and Bansal, 2016) with several enhancements such as character-level encoding attention mechanism on selecting pretrained concept candidate embeddings. Our official submission ranked the second in relation classification task (Subtask 1.1 and Subtask 2 Senerio 2), and the first in the relation extraction task (Subtask 2 Scenario 1).

pdf bib
Semi-Supervised Event Extraction with Paraphrase Clusters
James Ferguson | Colin Lockard | Daniel Weld | Hannaneh Hajishirzi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Supervised event extraction systems are limited in their accuracy due to the lack of available training data. We present a method for self-training event extraction systems by bootstrapping additional training data. This is done by taking advantage of the occurrence of multiple mentions of the same event instances across newswire articles from multiple sources. If our system can make a high-confidence extraction of some mentions in such a cluster, it can then acquire diverse training examples by adding the other mentions as well. Our experiments show significant performance improvements on multiple event extractors over ACE 2005 and TAC-KBP 2015 datasets.

pdf bib
Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension
Minjoon Seo | Tom Kwiatkowski | Ankur Parikh | Ali Farhadi | Hannaneh Hajishirzi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We formalize a new modular variant of current question answering tasks by enforcing complete independence of the document encoder from the question encoder. This formulation addresses a key challenge in machine comprehension by building a standalone representation of the document discourse. It additionally leads to a significant scalability advantage since the encoding of the answer candidate phrases in the document can be pre-computed and indexed offline for efficient retrieval. We experiment with baseline models for the new task, which achieve a reasonable accuracy but significantly underperform unconstrained QA models. We invite the QA research community to engage in Phrase-Indexed Question Answering (PIQA, pika) for closing the gap. The leaderboard is at: nlp.cs.washington.edu/piqa

pdf bib
Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction
Yi Luan | Luheng He | Mari Ostendorf | Hannaneh Hajishirzi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce a multi-task setup of identifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called SciIE with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.

pdf bib
Pyramidal Recurrent Unit for Language Modeling
Sachin Mehta | Rik Koncel-Kedziorski | Mohammad Rastegari | Hannaneh Hajishirzi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

LSTMs are powerful tools for modeling contextual information, as evidenced by their success at the task of language modeling. However, modeling contexts in very high dimensional space can lead to poor generalizability. We introduce the Pyramidal Recurrent Unit (PRU), which enables learning representations in high dimensional space with more generalization power and fewer parameters. PRUs replace the linear transformation in LSTMs with more sophisticated interactions such as pyramidal or grouped linear transformations. This architecture gives strong results on word-level language modeling while reducing parameters significantly. In particular, PRU improves the perplexity of a recent state-of-the-art language model by up to 1.3 points while learning 15-20% fewer parameters. For similar number of model parameters, PRU outperforms all previous RNN models that exploit different gating mechanisms and transformations. We provide a detailed examination of the PRU and its behavior on the language modeling tasks. Our code is open-source and available at https://sacmehta.github.io/PRU/.

bib
Standardized Tests as benchmarks for Artificial Intelligence
Mrinmaya Sachan | Minjoon Seo | Hannaneh Hajishirzi | Eric Xing
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Standardized tests have recently been proposed as replacements to the Turing test as a driver for progress in AI (Clark, 2015). These include tests on understanding passages and stories and answering questions about them (Richardson et al., 2013; Rajpurkar et al., 2016a, inter alia), science question answering (Schoenick et al., 2016, inter alia), algebra word problems (Kushman et al., 2014, inter alia), geometry problems (Seo et al., 2015; Sachan et al., 2016), visual question answering (Antol et al., 2015), etc. Many of these tests require sophisticated understanding of the world, aiming to push the boundaries of AI. For this tutorial, we broadly categorize these tests into two categories: open domain tests such as reading comprehensions and elementary school tests where the goal is to find the support for an answer from the student curriculum, and closed domain tests such as intermediate level math and science tests (algebra, geometry, Newtonian physics problems, etc.). Unlike open domain tests, closed domain tests require the system to have significant domain knowledge and reasoning capabilities. For example, geometry questions typically involve a number of geometry primitives (lines, quadrilaterals, circles, etc) and require students to use axioms and theorems of geometry (Pythagoras theorem, alternating angles, etc) to solve them. These closed domains often have a formal logical basis and the question can be mapped to a formal language by semantic parsing. The formal question representation can then provided as an input to an expert system to solve the question.

2017

pdf bib
Scientific Information Extraction with Semi-supervised Neural Tagging
Yi Luan | Mari Ostendorf | Hannaneh Hajishirzi
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper addresses the problem of extracting keyphrases from scientific articles and categorizing them as corresponding to a task, process, or material. We cast the problem as sequence tagging and introduce semi-supervised methods to a neural tagging model, which builds on recent advances in named entity recognition. Since annotated training data is scarce in this domain, we introduce a graph-based semi-supervised algorithm together with a data selection scheme to leverage unannotated articles. Both inductive and transductive semi-supervised learning strategies outperform state-of-the-art information extraction performance on the 2017 SemEval Task 10 ScienceIE task.

pdf bib
Question Answering through Transfer Learning from Large Fine-grained Supervision Data
Sewon Min | Minjoon Seo | Hannaneh Hajishirzi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset. We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD. For WikiQA, our model outperforms the previous best model by more than 8%. We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis. We also show that a similar transfer learning procedure achieves the state of the art on an entailment task.

2016

pdf bib
Learning Prototypical Event Structure from Photo Albums
Antoine Bosselut | Jianfu Chen | David Warren | Hannaneh Hajishirzi | Yejin Choi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Multiplicative Representations for Unsupervised Semantic Role Induction
Yi Luan | Yangfeng Ji | Hannaneh Hajishirzi | Boyang Li
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Theme-Rewriting Approach for Generating Algebra Word Problems
Rik Koncel-Kedziorski | Ioannis Konstas | Luke Zettlemoyer | Hannaneh Hajishirzi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
MAWPS: A Math Word Problem Repository
Rik Koncel-Kedziorski | Subhro Roy | Aida Amini | Nate Kushman | Hannaneh Hajishirzi
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
Solving Geometry Problems: Combining Text and Diagram Interpretation
Minjoon Seo | Hannaneh Hajishirzi | Ali Farhadi | Oren Etzioni | Clint Malcolm
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Talking to the crowd: What do people react to in online discussions?
Aaron Jaech | Victoria Zayats | Hao Fang | Mari Ostendorf | Hannaneh Hajishirzi
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Aligning Sentences from Standard Wikipedia to Simple Wikipedia
William Hwang | Hannaneh Hajishirzi | Mari Ostendorf | Wei Wu
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Learning Knowledge Graphs for Question Answering through Conversational Dialog
Ben Hixon | Peter Clark | Hannaneh Hajishirzi
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Unediting: Detecting Disfluencies Without Careful Transcripts
Victoria Zayats | Mari Ostendorf | Hannaneh Hajishirzi
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Parsing Algebraic Word Problems into Equations
Rik Koncel-Kedziorski | Hannaneh Hajishirzi | Ashish Sabharwal | Oren Etzioni | Siena Dumas Ang
Transactions of the Association for Computational Linguistics, Volume 3

This paper formalizes the problem of solving multi-sentence algebraic word problems as that of generating and scoring equation trees. We use integer linear programming to generate equation trees and score their likelihood by learning local and global discriminative models. These models are trained on a small set of word problems and their answers, without any manual annotation, in order to choose the equation that best matches the problem text. We refer to the overall system as Alges. We compare Alges with previous work and show that it covers the full gamut of arithmetic operations whereas Hosseini et al. (2014) only handle addition and subtraction. In addition, Alges overcomes the brittleness of the Kushman et al. (2014) approach on single-equation problems, yielding a 15% to 50% reduction in error.

2014

pdf bib
Multi-Resolution Language Grounding with Weak Supervision
R. Koncel-Kedziorski | Hannaneh Hajishirzi | Ali Farhadi
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Learning to Solve Arithmetic Word Problems with Verb Categorization
Mohammad Javad Hosseini | Hannaneh Hajishirzi | Oren Etzioni | Nate Kushman
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves
Hannaneh Hajishirzi | Leila Zilles | Daniel S. Weld | Luke Zettlemoyer
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
Using Group History to Identify Character-Directed Utterances in Multi-Child Interactions
Hannaneh Hajishirzi | Jill F. Lehman | Jessica K. Hodgins
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Search
Co-authors