Ashish Sabharwal


2022

pdf bib
Hey AI, Can You Solve Complex Tasks by Talking to Agents?
Tushar Khot | Kyle Richardson | Daniel Khashabi | Ashish Sabharwal
Findings of the Association for Computational Linguistics: ACL 2022

Training giant models from scratch for each complex task is resource- and data-inefficient. To help develop models that can leverage existing systems, we propose a new challenge: Learning to solve complex tasks by communicating with existing agents (or models) in natural language. We design a synthetic benchmark, CommaQA, with three complex reasoning tasks (explicit, implicit, numeric) designed to be solved by communicating with existing QA agents. For instance, using text and table QA agents to answer questions such as “Who had the longest javelin throw from USA?”. We show that black-box models struggle to learn this task from scratch (accuracy under 50%) even with access to each agent’s knowledge and gold facts supervision. In contrast, models that learn to communicate with agents outperform black-box models, reaching scores of 100% when given gold decomposition supervision. However, we show that the challenge of learning to solve complex tasks by communicating with existing agents without relying on any auxiliary supervision or data still remains highly elusive. We will release CommaQA, along with a compositional generalization test split, to advance research in this direction.

2021

pdf bib
How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI
Ashwin Kalyan | Abhinav Kumar | Arjun Chandrasekaran | Ashish Sabharwal | Peter Clark
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Many real-world problems require the combined application of multiple reasoning abilities—employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, “How much would the sea level rise if all ice in the world melted?” FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question-answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, helping in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large-scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.

pdf bib
Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?
Jieyu Zhao | Daniel Khashabi | Tushar Khot | Ashish Sabharwal | Kai-Wei Chang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
GooAQ: Open Question Answering with Diverse Answer Types
Daniel Khashabi | Amos Ng | Tushar Khot | Ashish Sabharwal | Hannaneh Hajishirzi | Chris Callison-Burch
Findings of the Association for Computational Linguistics: EMNLP 2021

While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google’s responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmark T5 models on GooAQ and observe that: (a) in line with recent work, LM’s strong performance on GooAQ’s short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as ‘how’ and ‘why’ questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types.

pdf bib
Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models
Tushar Khot | Daniel Khashabi | Kyle Richardson | Peter Clark | Ashish Sabharwal
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a general framework called Text Modular Networks(TMNs) for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. To ensure solvability of simpler tasks, TMNs learn the textual input-output behavior (i.e., language) of existing models through their datasets. This differs from prior decomposition-based approaches which, besides being designed specifically for each complex task, produce decompositions independent of existing sub-models. Specifically, we focus on Question Answering (QA) and show how to train a next-question generator to sequentially produce sub-questions targeting appropriate sub-models, without additional human annotation. These sub-questions and answers provide a faithful natural language explanation of the model’s reasoning. We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator. Our experiments show that ModularQA is more versatile than existing explainable systems for DROP and HotpotQA datasets, is more robust than state-of-the-art blackbox (uninterpretable) systems, and generates more understandable and trustworthy explanations compared to prior work.

pdf bib
Temporal Reasoning on Implicit Events from Distant Supervision
Ben Zhou | Kyle Richardson | Qiang Ning | Tushar Khot | Ashish Sabharwal | Dan Roth
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose TRACIE, a novel temporal reasoning dataset that evaluates the degree to which systems understand implicit events—events that are not mentioned explicitly in natural language text but can be inferred from it. This introduces a new challenge in temporal reasoning research, where prior work has focused on explicitly mentioned events. Human readers can infer implicit events via commonsense reasoning, resulting in a more comprehensive understanding of the situation and, consequently, better reasoning about time. We find, however, that state-of-the-art models struggle when predicting temporal relationships between implicit and explicit events. To address this, we propose a neuro-symbolic temporal reasoning model, SymTime, which exploits distant supervision signals from large-scale text and uses temporal rules to combine start times and durations to infer end times. SymTime outperforms strong baseline systems on TRACIE by 5%, and by 11% in a zero prior knowledge training setting. Our approach also generalizes to other temporal reasoning tasks, as evidenced by a gain of 1%-9% on MATRES, an explicit event benchmark.

pdf bib
ReadOnce Transformers: Reusable Representations of Text for Transformers
Shih-Ting Lin | Ashish Sabharwal | Tushar Khot
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We present ReadOnce Transformers, an approach to convert a transformer-based model into one that can build an information-capturing, task-independent, and compressed representation of text. The resulting representation is reusable across different examples and tasks, thereby requiring a document shared across many examples or tasks to only be read once. This leads to faster training and evaluation of models. Additionally, we extend standard text-to-text transformer models to Representation+Text-to-text models, and evaluate on multiple downstream tasks: multi-hop QA, abstractive QA, and long-document summarization. Our one-time computed representation results in a 2x-5x speedup compared to standard text-to-text models, while the compression also allows existing language models to handle longer documents without the need for designing new pre-trained models.

2020

pdf bib
Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses
Erfan Sadeqi Azer | Daniel Khashabi | Ashish Sabharwal | Dan Roth
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis assessment techniques, especially those not commonly used in the field (such as evaluations based on Bayesian inference). Since these statistical techniques differ in the hypotheses they can support, we argue that practitioners should first decide their target hypothesis before choosing an assessment method. This is crucial because common fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research, as well as an easy-to-use package for Bayesian assessment of hypotheses, complementing existing tools.

pdf bib
More Bang for Your Buck: Natural Perturbation for Robust Question Answering
Daniel Khashabi | Tushar Khot | Ashish Sabharwal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Deep learning models for linguistic tasks require large training datasets, which are expensive to create. As an alternative to the traditional approach of creating new instances by repeating the process of creating one instance, we propose doing so by first collecting a set of seed examples and then applying human-driven natural perturbations (as opposed to rule-based machine perturbations), which often change the gold label as well. Such perturbations have the advantage of being relatively easier (and hence cheaper) to create than writing out completely new examples. Further, they help address the issue that even models achieving human-level scores on NLP datasets are known to be considerably sensitive to small changes in input. To evaluate the idea, we consider a recent question-answering dataset (BOOLQ) and study our approach as a function of the perturbation cost ratio, the relative cost of perturbing an existing question vs. creating a new one from scratch. We find that when natural perturbations are moderately cheaper to create (cost ratio under 60%), it is more effective to use them for training BOOLQ models: such models exhibit 9% higher robustness and 4.5% stronger generalization, while retaining performance on the original BOOLQ dataset.

pdf bib
A Simple Yet Strong Pipeline for HotpotQA
Dirk Groeneveld | Tushar Khot | Mausam | Ashish Sabharwal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition. However, does their strong performance on popular multi-hop datasets really justify this added design complexity? Our results suggest that the answer may be no, because even our simple pipeline based on BERT, named , performs surprisingly well. Specifically, on HotpotQA, Quark outperforms these models on both question answering and support identification (and achieves performance very close to a RoBERTa model). Our pipeline has three steps: 1) use BERT to identify potentially relevant sentences independently of each other; 2) feed the set of selected sentences as context into a standard BERT span prediction model to choose an answer; and 3) use the sentence selection model, now with the chosen answer, to produce supporting sentences. The strong performance of Quark resurfaces the importance of carefully exploring simple model designs before using popular benchmarks to justify the value of complex techniques.

pdf bib
Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning
Harsh Trivedi | Niranjan Balasubramanian | Tushar Khot | Ashish Sabharwal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the purpose of building multi-hop QA datasets. We make three contributions towards addressing this. First, we formalize such undesirable behavior as disconnected reasoning across subsets of supporting facts. This allows developing a model-agnostic probe for measuring how much any model can cheat via disconnected reasoning. Second, using a notion of contrastive support sufficiency, we introduce an automatic transformation of existing datasets that reduces the amount of disconnected reasoning. Third, our experiments suggest that there hasn’t been much progress in multi-hop QA in the reading comprehension setting. For a recent large-scale model (XLNet), we show that only 18 points out of its answer F1 score of 72 on HotpotQA are obtained through multifact reasoning, roughly the same as that of a simpler RNN baseline. Our transformation substantially reduces disconnected reasoning (19 points in answer F1). It is complementary to adversarial approaches, yielding further reductions in conjunction.

pdf bib
UNIFIEDQA: Crossing Format Boundaries with a Single QA System
Daniel Khashabi | Sewon Min | Tushar Khot | Ashish Sabharwal | Oyvind Tafjord | Peter Clark | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: EMNLP 2020

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats. UNIFIEDQA performs on par with 8 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UNIFIEDQA performs surprisingly well, showing strong generalization from its outof-format training data. Finally, simply finetuning this pre trained QA model into specialized models results in a new state of the art on 10 factoid and commonsense question answering datasets, establishing UNIFIEDQA as a strong starting point for building QA systems.

pdf bib
UNQOVERing Stereotyping Biases via Underspecified Questions
Tao Li | Daniel Khashabi | Tushar Khot | Ashish Sabharwal | Vivek Srikumar
Findings of the Association for Computational Linguistics: EMNLP 2020

While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence. We design a formalism that isolates the aforementioned errors. As case studies, we use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion. We probe five transformer-based QA models trained on two QA datasets, along with their underlying language models. Our broad study reveals that (1) all these models, with and without fine-tuning, have notable stereotyping biases in these classes; (2) larger models often have higher bias; and (3) the effect of fine-tuning on bias varies strongly with the dataset and the model size.

pdf bib
What Does My QA Model Know? Devising Controlled Probes Using Expert Knowledge
Kyle Richardson | Ashish Sabharwal
Transactions of the Association for Computational Linguistics, Volume 8

Open-domain question answering (QA) involves many knowledge and reasoning challenges, but are successful QA models actually learning such knowledge when trained on benchmark QA tasks? We investigate this via several new diagnostic tasks probing whether multiple-choice QA models know definitions and taxonomic reasoning—two skills widespread in existing benchmarks and fundamental to more complex reasoning. We introduce a methodology for automatically building probe datasets from expert knowledge sources, allowing for systematic control and a comprehensive evaluation. We include ways to carefully control for artifacts that may arise during this process. Our evaluation confirms that transformer-based multiple-choice QA models are already predisposed to recognize certain types of structural linguistic knowledge. However, it also reveals a more nuanced picture: their performance notably degrades even with a slight increase in the number of “hops” in the underlying taxonomic hierarchy, and with more challenging distractor candidates. Further, existing models are far from perfect when assessed at the level of clusters of semantically connected probes, such as all hypernym questions about a single concept.

2019

pdf bib
What’s Missing: A Knowledge Gap Guided Approach for Multi-hop Question Answering
Tushar Khot | Ashish Sabharwal | Peter Clark
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Multi-hop textual question answering requires combining information from multiple sentences. We focus on a natural setting where, unlike typical reading comprehension, only partial information is provided with each question. The model must retrieve and use additional knowledge to correctly answer the question. To tackle this challenge, we develop a novel approach that explicitly identifies the knowledge gap between a key span in the provided knowledge and the answer choices. The model, GapQA, learns to fill this gap by determining the relationship between the span and an answer choice, based on retrieved knowledge targeting this gap. We propose jointly training a model to simultaneously fill this knowledge gap and compose it with the provided partial knowledge. On the OpenBookQA dataset, given partial knowledge, explicitly identifying what’s missing substantially outperforms previous approaches.

pdf bib
Exploiting Explicit Paths for Multi-hop Reading Comprehension
Souvik Kundu | Tushar Khot | Ashish Sabharwal | Peter Clark
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose a novel, path-based reasoning approach for the multi-hop reading comprehension task where a system needs to combine facts from multiple passages to answer a question. Although inspired by multi-hop reasoning over knowledge graphs, our proposed approach operates directly over unstructured text. It generates potential paths through passages and scores them without any direct path supervision. The proposed model, named PathNet, attempts to extract implicit relations from text through entity pair representations, and compose them to encode each path. To capture additional context, PathNet also composes the passage representations along each path to compute a passage-based representation. Unlike previous approaches, our model is then able to explain its reasoning via these explicit paths through the passages. We show that our approach outperforms prior models on the multi-hop Wikihop dataset, and also can be generalized to apply to the OpenBookQA dataset, matching state-of-the-art performance.

pdf bib
Repurposing Entailment for Multi-Hop Question Answering Tasks
Harsh Trivedi | Heeyoung Kwon | Tushar Khot | Ashish Sabharwal | Niranjan Balasubramanian
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Question Answering (QA) naturally reduces to an entailment problem, namely, verifying whether some text entails the answer to a question. However, for multi-hop QA tasks, which require reasoning with multiple sentences, it remains unclear how best to utilize entailment models pre-trained on large scale datasets such as SNLI, which are based on sentence pairs. We introduce Multee, a general architecture that can effectively use entailment models for multi-hop QA tasks. Multee uses (i) a local module that helps locate important sentences, thereby avoiding distracting information, and (ii) a global module that aggregates information by effectively incorporating importance weights. Importantly, we show that both modules can use entailment functions pre-trained on a large scale NLI datasets. We evaluate performance on MultiRC and OpenBookQA, two multihop QA datasets. When using an entailment function pre-trained on NLI datasets, Multee outperforms QA models trained only on the target QA datasets and the OpenAI transformer models.

2018

pdf bib
AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples
Dongyeop Kang | Tushar Khot | Ashish Sabharwal | Eduard Hovy
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We consider the problem of learning textual entailment models with limited supervision (5K-10K training examples), and present two complementary approaches for it. First, we propose knowledge-guided adversarial example generators for incorporating large lexical resources in entailment models via only a handful of rule templates. Second, to make the entailment model—a discriminator—more robust, we propose the first GAN-style approach for training it using a natural language example generator that iteratively adjusts to the discriminator’s weaknesses. We demonstrate effectiveness using two entailment datasets, where the proposed methods increase accuracy by 4.7% on SciTail and by 2.8% on a 1% sub-sample of SNLI. Notably, even a single hand-written rule, negate, improves the accuracy of negation examples in SNLI by 6.1%.

pdf bib
Knowledge Completion for Generics using Guided Tensor Factorization
Hanie Sedghi | Ashish Sabharwal
Transactions of the Association for Computational Linguistics, Volume 6

Given a knowledge base or KB containing (noisy) facts about common nouns or generics, such as “all trees produce oxygen” or “some animals live in forests”, we consider the problem of inferring additional such facts at a precision similar to that of the starting KB. Such KBs capture general knowledge about the world, and are crucial for various applications such as question answering. Different from commonly studied named entity KBs such as Freebase, generics KBs involve quantification, have more complex underlying regularities, tend to be more incomplete, and violate the commonly used locally closed world assumption (LCWA). We show that existing KB completion methods struggle with this new task, and present the first approach that is successful. Our results demonstrate that external information, such as relation schemas and entity taxonomies, if used appropriately, can be a surprisingly powerful tool in this setting. First, our simple yet effective knowledge guided tensor factorization approach achieves state-of-the-art results on two generics KBs (80% precise) for science, doubling their size at 74%–86% precision. Second, our novel taxonomy guided, submodular, active learning method for collecting annotations about rare entities (e.g., oriole, a bird) is 6x more effective at inferring further new facts about them than multiple active learning baselines.

pdf bib
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov | Peter Clark | Tushar Khot | Ashish Sabharwal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1326 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic—in the context of common knowledge—and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.

pdf bib
Bridging Knowledge Gaps in Neural Entailment via Symbolic Models
Dongyeop Kang | Tushar Khot | Ashish Sabharwal | Peter Clark
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. We focus on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Our new architecture combines standard neural entailment models with a knowledge lookup module. To facilitate this lookup, we propose a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. Our model, NSNet, learns to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSNet outperforms a simpler combination of the two predictions by 3% and the base entailment model by 5%.

2017

pdf bib
Learning What is Essential in Questions
Daniel Khashabi | Tushar Khot | Ashish Sabharwal | Dan Roth
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Question answering (QA) systems are easily distracted by irrelevant or redundant words in questions, especially when faced with long or multi-sentence questions in difficult domains. This paper introduces and studies the notion of essential question terms with the goal of improving such QA solvers. We illustrate the importance of essential question terms by showing that humans’ ability to answer questions drops significantly when essential terms are eliminated from questions.We then develop a classifier that reliably (90% mean average precision) identifies and ranks essential terms in questions. Finally, we use the classifier to demonstrate that the notion of question term essentiality allows state-of-the-art QA solver for elementary-level science questions to make better and more informed decisions,improving performance by up to 5%.We also introduce a new dataset of over 2,200 crowd-sourced essential terms annotated science questions.

pdf bib
Answering Complex Questions Using Open Information Extraction
Tushar Khot | Ashish Sabharwal | Peter Clark
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

While there has been substantial progress in factoid question-answering (QA), answering complex questions remains challenging, typically requiring both a large body of knowledge and inference techniques. Open Information Extraction (Open IE) provides a way to generate semi-structured knowledge for QA, but to date such knowledge has only been used to answer simple questions with retrieval-based methods. We overcome this limitation by presenting a method for reasoning with Open IE knowledge, allowing more complex questions to be handled. Using a recently proposed support graph optimization framework for QA, we develop a new inference model for Open IE, in particular one that can work effectively with multiple short facts, noise, and the relational structure of tuples. Our model significantly outperforms a state-of-the-art structured solver on complex questions of varying difficulty, while also removing the reliance on manually curated knowledge.

2015

pdf bib
Exploring Markov Logic Networks for Question Answering
Tushar Khot | Niranjan Balasubramanian | Eric Gribkoff | Ashish Sabharwal | Peter Clark | Oren Etzioni
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Parsing Algebraic Word Problems into Equations
Rik Koncel-Kedziorski | Hannaneh Hajishirzi | Ashish Sabharwal | Oren Etzioni | Siena Dumas Ang
Transactions of the Association for Computational Linguistics, Volume 3

This paper formalizes the problem of solving multi-sentence algebraic word problems as that of generating and scoring equation trees. We use integer linear programming to generate equation trees and score their likelihood by learning local and global discriminative models. These models are trained on a small set of word problems and their answers, without any manual annotation, in order to choose the equation that best matches the problem text. We refer to the overall system as Alges. We compare Alges with previous work and show that it covers the full gamut of arithmetic operations whereas Hosseini et al. (2014) only handle addition and subtraction. In addition, Alges overcomes the brittleness of the Kushman et al. (2014) approach on single-equation problems, yielding a 15% to 50% reduction in error.