Peter Jansen - ACL Anthology

Peter Jansen

Also published as: Peter J. Jansen

2025

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
Peter Jansen | Oyvind Tafjord | Marissa Radensky | Pao Siangliulue | Tom Hope | Bhavana Dalvi Mishra | Bodhisattwa Prasad Majumder | Daniel S Weld | Peter Clark
Findings of the Association for Computational Linguistics: ACL 2025

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.

Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Peter Jansen | Samiah Hassan | Ruoyao Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims, while operationalizing feasibility assessment as a temporally-filtered claim verification task using backtesting. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable – highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.

Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities
Peter Jansen | Bhavana Dalvi Mishra | Harsh Trivedi | Bodhisattwa Prasad Majumder | Tom Hope | Tushar Khot | Doug Downey | Eric Horvitz
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities

2024

PDDLEGO: Iterative Planning in Textual Environments
Li Zhang | Peter Jansen | Tianyi Zhang | Peter Clark | Chris Callison-Burch | Niket Tandon
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).

Proceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL 2024)
Bhavana Dalvi Mishra | Greg Durrett | Peter Jansen | Ben Lipkin | Danilo Neves Ribeiro | Lionel Wong | Xi Ye | Wenting Zhao
Proceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL 2024)

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic
Nathaniel Weir | Kate Sanders | Orion Weller | Shreya Sharma | Dongwei Jiang | Zhengping Jiang | Bhavana Dalvi Mishra | Oyvind Tafjord | Peter Jansen | Peter Clark | Benjamin Van Durme
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what _valid decompositional entailment_ is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic entailment engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.

Can Language Models Serve as Text-Based World Simulators?
Ruoyao Wang | Graham Todd | Ziang Xiao | Xingdi Yuan | Marc-Alexandre Côté | Peter Clark | Peter Jansen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

2023

Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)
Bhavana Dalvi Mishra | Greg Durrett | Peter Jansen | Danilo Neves Ribeiro | Jason Wei
Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)

From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions
Peter Jansen
Findings of the Association for Computational Linguistics: EMNLP 2023

In this work, we show that contemporary language models have a previously unknown skill – the capacity for electronic circuit design from high-level textual descriptions, akin to code generation. We introduce two benchmarks: PINS100, assessing model knowledge of electrical components, and MICRO25, evaluating a model’s capability to design common microcontroller circuits and code in the Arduino ecosystem that involve input, output, sensors, motors, protocols, and logic – with models such as GPT-4 and Claude-V1 achieving between 60% to 96% Pass@1 on generating full devices. We include six case studies of using language models as a design assistant for moderately complex devices, such as a radiation-powered random number generator, an emoji keyboard, a visible spectrometer, and several assistive devices, while offering a qualitative analysis performance, outlining evaluation challenges, and suggesting areas of development to improve complex circuit design and practical utility. With this work, we aim to spur research at the juncture of natural language processing and electronic design.

Self-Supervised Behavior Cloned Transformers are Path Crawlers for Text Games
Ruoyao Wang | Peter Jansen
Findings of the Association for Computational Linguistics: EMNLP 2023

In this work, we introduce a self-supervised behavior cloning transformer for text games, which are challenging benchmarks for multi-step reasoning in virtual environments. Traditionally, Behavior Cloning Transformers excel in such tasks but rely on supervised training data. Our approach auto-generates training data by exploring trajectories (defined by common macro-action sequences) that lead to reward within the games, while determining the generality and utility of these trajectories by rapidly training small models then evalauating their performance on unseen development games. Through empirical analysis, we show our method consistently uncovers generalizable training data, achieving about 90% performance of supervised systems across three benchmark text games.

ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Ruoyao Wang | Graham Todd | Xingdi Yuan | Ziang Xiao | Marc-Alexandre Côté | Peter Jansen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this work we investigate the capacity of language models to generate explicit, interpretable, and interactive world models of scientific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of Python code. To facilitate this task, we introduce ByteSized32, a corpus of 32 reasoning-focused text games totalling 20k lines of Python code. We empirically demonstrate that GPT-4 can use these games as templates for single-shot in-context learning, successfully producing runnable games on unseen topics in 28% of cases. When allowed to self-reflect on program errors, game runnability substantially increases to 58%. While evaluating simulation fidelity is labor intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high-degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation.

Behavior Cloned Transformers are Neurosymbolic Reasoners
Ruoyao Wang | Peter Jansen | Marc-Alexandre Côté | Prithviraj Ammanabrolu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

In this work, we explore techniques for augmenting interactive agents with information from symbolic modules, much like humans use tools like calculators and GPS systems to assist with arithmetic and navigation. We test our agent’s abilities in text games – challenging benchmarks for evaluating the multi-step reasoning abilities of game agents in grounded, language-based environments. Our experimental study indicates that injecting the actions from these symbolic modules into the action space of a behavior cloned transformer agent increases performance on four text game benchmarks that test arithmetic, navigation, sorting, and common sense reasoning by an average of 22%, allowing an agent to reach the highest possible performance on unseen games. This action injection technique is easily extended to new agents, environments, and symbolic modules.

TextWorldExpress: Simulating Text Games at One Million Steps Per Second
Peter Jansen | Marc-alexandre Cote
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Text-based games offer a challenging test bed to evaluate virtual agents at language understanding, multi-step problem-solving, and common-sense reasoning. However, speed is a major limitation of current text-based games, capping at 300 steps per second, mainly due to the use of legacy tooling. In this work we present TextWorldExpress, a high-performance simulator that includes implementations of three common text game benchmarks that increases simulation throughput by approximately three orders of magnitude, reaching over one million steps per second on common desktop hardware. This significantly reduces experiment runtime, enabling billion-step-scale experiments in about one day.

2022

A Systematic Survey of Text Worlds as Embodied Natural Language Environments
Peter Jansen
Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022)

Text Worlds are virtual environments for embodied agents that, unlike 2D or 3D environments, are rendered exclusively using textual descriptions. These environments offer an alternative to higher-fidelity 3D environments due to their low barrier to entry, providing the ability to study semantics, compositional inference, and other high-level tasks with rich action spaces while controlling for perceptual input. This systematic survey outlines recent developments in tooling, environments, and agent modeling for Text Worlds, while examining recent trends in knowledge graphs, common sense reasoning, transfer learning of Text World performance to higher-fidelity environments, as well as near-term development targets that, once achieved, make Text Worlds an attractive general research paradigm for natural language processing.

Extracting Space Situational Awareness Events from News Text
Zhengnan Xie | Alice Saebom Kwak | Enfa George | Laura W. Dozal | Hoang Van | Moriba Jah | Roberto Furfaro | Peter Jansen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Space situational awareness typically makes use of physical measurements from radar, telescopes, and other assets to monitor satellites and other spacecraft for operational, navigational, and defense purposes. In this work we explore using textual input for the space situational awareness task. We construct a corpus of 48.5k news articles spanning all known active satellites between 2009 and 2020. Using a dependency-rule-based extraction system designed to target three high-impact events – spacecraft launches, failures, and decommissionings, we identify 1,787 space-event sentences that are then annotated by humans with 15.9k labels for event slots. We empirically demonstrate a state-of-the-art neural extraction system achieves an overall F1 between 53 and 91 per slot for event extraction in this low-resource, high-impact domain.

ScienceWorld: Is your Agent Smarter than a 5th Grader?
Ruoyao Wang | Peter Jansen | Marc-Alexandre Côté | Prithviraj Ammanabrolu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We present ScienceWorld, a benchmark to test agents’ scientific reasoning abilities in a new interactive text environment at the level of a standard elementary school science curriculum. Despite the transformer-based progress seen in question-answering and scientific text processing, we find that current models cannot reason about or explain learned science concepts in novel contexts. For instance, models can easily answer what the conductivity of a known material is but struggle when asked how they would conduct an experiment in a grounded environment to find the conductivity of an unknown material. This begs the question of whether current models are simply retrieving answers by way of seeing a large number of similar examples or if they have learned to reason about concepts in a reusable manner. We hypothesize that agents need to be grounded in interactive environments to achieve such reasoning capabilities. Our experiments provide empirical evidence supporting this hypothesis – showing that a 1.5 million parameter agent trained interactively for 100k steps outperforms a 11 billion parameter model statically trained for scientific question-answering and reasoning from millions of expert demonstrations.

2021

TextGraphs 2021 Shared Task on Multi-Hop Inference for Explanation Regeneration
Mokanarangan Thayaparan | Marco Valentino | Peter Jansen | Dmitry Ustalov
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

The Shared Task on Multi-Hop Inference for Explanation Regeneration asks participants to compose large multi-hop explanations to questions by assembling large chains of facts from a supporting knowledge base. While previous editions of this shared task aimed to evaluate explanatory completeness – finding a set of facts that form a complete inference chain, without gaps, to arrive from question to correct answer, this 2021 instantiation concentrates on the subtask of determining relevance in large multi-hop explanations. To this end, this edition of the shared task makes use of a large set of approximately 250k manual explanatory relevancy ratings that augment the 2020 shared task data. In this summary paper, we describe the details of the explanation regeneration task, the evaluation data, and the participating systems. Additionally, we perform a detailed analysis of participating systems, evaluating various aspects involved in the multi-hop inference process. The best performing system achieved an NDCG of 0.82 on this challenging task, substantially increasing performance over baseline methods by 32%, while also leaving significant room for future improvement.

Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)
Alexander Panchenko | Fragkiskos D. Malliaros | Varvara Logacheva | Abhik Jana | Dmitry Ustalov | Peter Jansen
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings
Peter Jansen | Kelly J. Smith | Dan Moreno | Huitzilin Ortiz
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these “multi-hop” explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36% when compared with full manual expert judgements, with different models being disproportionately affected. This poses a significant methodological challenge to accurately evaluating explanations produced by compositional reasoning models.

Explaining Answers with Entailment Trees
Bhavana Dalvi | Peter Jansen | Oyvind Tafjord | Zhengnan Xie | Hannah Smith | Leighanna Pipatanangkura | Peter Clark
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by showing the line of reasoning from what is known to the answer, rather than simply showing a fragment of textual evidence (a “rationale”). If this could be done, new opportunities for understanding and debugging the system’s reasoning become possible. Our approach is to generate explanations in the form of entailment trees, namely a tree of multipremise entailment steps from facts that are known, through intermediate conclusions, to the hypothesis of interest (namely the question + answer). To train a model with this skill, we created ENTAILMENTBANK, the first dataset to contain multistep entailment trees. Given a hypothesis (question + answer), we define three increasingly difficult explanation tasks: generate a valid entailment tree given (a) all relevant sentences (b) all relevant and some irrelevant sentences, or (c) a corpus. We show that a strong language model can partially solve these tasks, in particular when the relevant sentences are included in the input (e.g., 35% of trees for (a) are perfect), and with indications of generalization to other domains. This work is significant as it provides a new type of dataset (multistep entailments) and baselines, offering a new avenue for the community to generate richer, more systematic explanations.

2020

TextGraphs 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration
Peter Jansen | Dmitry Ustalov
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

The 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration tasks participants with regenerating large detailed multi-fact explanations for standardized science exam questions. Given a question, correct answer, and knowledge base, models must rank each fact in the knowledge base such that facts most likely to appear in the explanation are ranked highest. Explanations consist of an average of 6 (and as many as 16) facts that span both core scientific knowledge and world knowledge, and form an explicit lexically-connected “explanation graph” describing how the facts interrelate. In this second iteration of the explanation regeneration shared task, participants are supplied with more than double the training and evaluation data of the first shared task, as well as a knowledge base nearly double in size, both of which expand into more challenging scientific topics that increase the difficulty of the task. In total 10 teams participated, and 5 teams submitted system description papers. The best-performing teams significantly increased state-of-the-art performance both in terms of ranking (mean average precision) and inference speed on this challenge task.

Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
Dmitry Ustalov | Swapna Somasundaran | Alexander Panchenko | Fragkiskos D. Malliaros | Ioana Hulpuș | Peter Jansen | Abhik Jana
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

WorldTree V2: A Corpus of Science-Domain Structured Explanations and Inference Patterns supporting Multi-Hop Inference
Zhengnan Xie | Sebastian Thiem | Jaycie Martin | Elizabeth Wainwright | Steven Marmorstein | Peter Jansen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Explainable question answering for complex questions often requires combining large numbers of facts to answer a question while providing a human-readable explanation for the answer, a process known as multi-hop inference. Standardized science questions require combining an average of 6 facts, and as many as 16 facts, in order to answer and explain, but most existing datasets for multi-hop reasoning focus on combining only two facts, significantly limiting the ability of multi-hop inference algorithms to learn to generate large inferences. In this work we present the second iteration of the WorldTree project, a corpus of 5,114 standardized science exam questions paired with large detailed multi-fact explanations that combine core scientific knowledge and world knowledge. Each explanation is represented as a lexically-connected “explanation graph” that combines an average of 6 facts drawn from a semi-structured knowledge base of 9,216 facts across 66 tables. We use this explanation corpus to author a set of 344 high-level science domain inference patterns similar to semantic frames supporting multi-hop inference. Together, these resources provide training data and instrumentation for developing many-fact multi-hop inference models for question answering.

Multi-class Hierarchical Question Classification for Multiple Choice Science Exams
Dongfang Xu | Peter Jansen | Jaycie Martin | Zhengnan Xie | Vikas Yadav | Harish Tayyar Madabushi | Oyvind Tafjord | Peter Clark
Proceedings of the Twelfth Language Resources and Evaluation Conference

Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed classification labels from a fine-grained hierarchical taxonomy of 406 problem domains. We then show that a BERT-based model trained on this dataset achieves a large (+0.12 MAP) gain compared with previous methods, while also achieving state-of-the-art performance on benchmark open-domain and biomedical QC datasets. Finally, we show that using this model’s predictions of question topic significantly improves the accuracy of a question answering system by +1.7% P@1, with substantial future gains possible as QC performance improves.

ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition
Hannah Smith | Zeyu Zhang | John Culnan | Peter Jansen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Named entity recognition identifies common classes of entities in text, but these entity labels are generally sparse, limiting utility to downstream tasks. In this work we present ScienceExamCER, a densely-labeled semantic classification corpus of 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms. Semantic class labels are drawn from a manually-constructed fine-grained typology of 601 classes generated through a data-driven analysis of 4,239 science exam questions. We show an off-the-shelf BERT-based named entity recognition model modified for multi-label classification achieves an accuracy of 0.85 F1 on this task, suggesting strong utility for downstream tasks in science domain question answering requiring densely-labeled semantic classification.

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions
Peter Jansen
Findings of the Association for Computational Linguistics: EMNLP 2020

The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as “put a hot piece of bread on a plate”. Currently, the best-performing models are able to complete less than 1% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases. When a small amount of visual information, the starting location in the virtual environment, is incorporated, our best-performing GPT-2 model successfully generates gold command sequences in 58% of cases, suggesting contextualized language models may provide strong planning modules for grounded virtual agents.

CoSaTa: A Constraint Satisfaction Solver and Interpreted Language for Semi-Structured Tables of Sentences
Peter Jansen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This work presents CoSaTa, an intuitive constraint satisfaction solver and interpreted language for knowledge bases of semi-structured tables expressed as text. The stand-alone CoSaTa solver allows easily expressing complex compositional “inference patterns” for how knowledge from different tables tends to connect to support inference and explanation construction in question answering and other downstream tasks, while including advanced declarative features and the ability to operate over multiple representations of text (words, lemmas, or part-of-speech tags). CoSaTa also includes a hybrid imperative/declarative interpreted language for expressing simple models through minimally-specified simulations grounded in constraint patterns, helping bridge the gap between question answering, question explanation, and model simulation. The solver and interpreter are released as open source. Screencast Demo: https://youtu.be/t93Acsz7LyE

2019

Extracting Common Inference Patterns from Semi-Structured Explanations
Sebastian Thiem | Peter Jansen
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Complex questions often require combining multiple facts to correctly answer, particularly when generating detailed explanations for why those answers are correct. Combining multiple facts to answer questions is often modeled as a “multi-hop” graph traversal problem, where a given solver must find a series of interconnected facts in a knowledge graph that, taken together, answer the question and explain the reasoning behind that answer. Multi-hop inference currently suffers from semantic drift, or the tendency for chains of reasoning to “drift”’ to unrelated topics, and this semantic drift greatly limits the number of facts that can be combined in both free text or knowledge base inference. In this work we present our effort to mitigate semantic drift by extracting large high-confidence multi-hop inference patterns, generated by abstracting large-scale explanatory structure from a corpus of detailed explanations. We represent these inference patterns as sets of generalized constraints over sentences represented as rows in a knowledge base of semi-structured tables. We present a prototype tool for identifying common inference patterns from corpora of semi-structured explanations, and use it to successfully extract 67 inference patterns from a “matter” subset of standardized elementary science exam questions that span scientific and world knowledge.

TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration
Peter Jansen | Dmitry Ustalov
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

While automated question answering systems are increasingly able to retrieve answers to natural language questions, their ability to generate detailed human-readable explanations for their answers is still quite limited. The Shared Task on Multi-Hop Inference for Explanation Regeneration tasks participants with regenerating detailed gold explanations for standardized elementary science exam questions by selecting facts from a knowledge base of semi-structured tables. Each explanation contains between 1 and 16 interconnected facts that form an “explanation graph” spanning core scientific knowledge and detailed world knowledge. It is expected that successfully combining these facts to generate detailed explanations will require advancing methods in multi-hop inference and information combination, and will make use of the supervised training data provided by the WorldTree explanation corpus. The top-performing system achieved a mean average precision (MAP) of 0.56, substantially advancing the state-of-the-art over a baseline information retrieval model. Detailed extended analyses of all submitted systems showed large relative improvements in accessing the most challenging multi-hop inference problems, while absolute performance remains low, highlighting the difficulty of generating detailed explanations through multi-hop reasoning.

Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Dmitry Ustalov | Swapna Somasundaran | Peter Jansen | Goran Glavaš | Martin Riedl | Mihai Surdeanu | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

2018

Multi-hop Inference for Sentence-level TextGraphs: How Challenging is Meaningfully Combining Information for Science Question Answering?
Peter Jansen
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

Question Answering for complex questions is often modelled as a graph construction or traversal task, where a solver must build or traverse a graph of facts that answer and explain a given question. This “multi-hop” inference has been shown to be extremely challenging, with few models able to aggregate more than two facts before being overwhelmed by “semantic drift”, or the tendency for long chains of facts to quickly drift off topic. This is a major barrier to current inference models, as even elementary science questions require an average of 4 to 6 facts to answer and explain. In this work we empirically characterize the difficulty of building or traversing a graph of sentences connected by lexical overlap, by evaluating chance sentence aggregation quality through 9,784 manually-annotated judgements across knowledge graphs built from three free-text corpora (including study guides and Simple Wikipedia). We demonstrate semantic drift tends to be high and aggregation quality low, at between 0.04 and 3, and highlight scenarios that maximize the likelihood of meaningfully combining information.

WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-hop Inference
Peter Jansen | Elizabeth Wainwright | Steven Marmorstein | Clayton Morrison
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Tell Me Why: Using Question Answering as Distant Supervision for Answer Justification
Rebecca Sharp | Mihai Surdeanu | Peter Jansen | Marco A. Valenzuela-Escárcega | Peter Clark | Michael Hammond
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

For many applications of question answering (QA), being able to explain why a given model chose an answer is critical. However, the lack of labeled data for answer justifications makes learning this difficult and expensive. Here we propose an approach that uses answer ranking as distant supervision for learning how to select informative justifications, where justifications serve as inferential connections between the question and the correct answer while often containing little lexical overlap with either. We propose a neural network architecture for QA that reranks answer justifications as an intermediate (and human-interpretable) step in answer selection. Our approach is informed by a set of features designed to combine both learned representations and explicit features to capture the connection between questions, answers, and answer justifications. We show that with this end-to-end approach we are able to significantly improve upon a strong IR baseline in both justification ranking (+9% rated highly relevant) and answer selection (+6% P@1).

Framing QA as Building and Ranking Intersentence Answer Justifications
Peter Jansen | Rebecca Sharp | Mihai Surdeanu | Peter Clark
Computational Linguistics, Volume 43, Issue 2 - June 2017

We propose a question answering (QA) approach for standardized science exams that both identifies correct answers and produces compelling human-readable justifications for why those answers are correct. Our method first identifies the actual information needed in a question using psycholinguistic concreteness norms, then uses this information need to construct answer justifications by aggregating multiple sentences from different knowledge bases using syntactic and lexical information. We then jointly rank answers and their justifications using a reranking perceptron that treats justification quality as a latent variable. We evaluate our method on 1,000 multiple-choice questions from elementary school science exams, and empirically demonstrate that it performs better than several strong baselines, including neural network approaches. Our best configuration answers 44% of the questions correctly, where the top justifications for 57% of these correct answers contain a compelling human-readable justification that explains the inference required to arrive at the correct answer. We include a detailed characterization of the justification quality for both our method and a strong baseline, and show that information aggregation is key to addressing the information need in complex questions.

2016

Creating Causal Embeddings for Question Answering with Minimal Supervision
Rebecca Sharp | Mihai Surdeanu | Peter Jansen | Peter Clark | Michael Hammond
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams
Peter Jansen | Niranjan Balasubramanian | Mihai Surdeanu | Peter Clark
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

QA systems have been making steady advances in the challenging elementary science exam domain. In this work, we develop an explanation-based analysis of knowledge and inference requirements, which supports a fine-grained characterization of the challenges. In particular, we model the requirements based on appropriate sources of evidence to be used for the QA task. We create requirements by first identifying suitable sentences in a knowledge base that support the correct answer, then use these to build explanations, filling in any necessary missing information. These explanations are used to create a fine-grained categorization of the requirements. Using these requirements, we compare a retrieval and an inference solver on 212 questions. The analysis validates the gains of the inference solver, demonstrating that it answers more questions requiring complex inference, while also providing insights into the relative strengths of the solvers and knowledge sources. We release the annotated questions and explanations as a resource with broad utility for science exam QA, including determining knowledge base construction targets, as well as supporting information aggregation in automated inference.

2015

Higher-order Lexical Semantic Models for Non-factoid Answer Reranking
Daniel Fried | Peter Jansen | Gustave Hahn-Powell | Mihai Surdeanu | Peter Clark
Transactions of the Association for Computational Linguistics, Volume 3

Lexical semantic models provide robust performance for question answering, but, in general, can only capitalize on direct evidence seen during training. For example, monolingual alignment models acquire term alignment probabilities from semi-structured data such as question-answer pairs; neural network language models learn term embeddings from unstructured text. All this knowledge is then used to estimate the semantic similarity between question and answer candidates. We introduce a higher-order formalism that allows all these lexical semantic models to chain direct evidence to construct indirect associations between question and answer texts, by casting the task as the traversal of graphs that encode direct term associations. Using a corpus of 10,000 questions from Yahoo! Answers, we experimentally demonstrate that higher-order methods are broadly applicable to alignment and language models, across both word and syntactic representations. We show that an important criterion for success is controlling for the semantic drift that accumulates during graph traversal. All in all, the proposed higher-order approach improves five out of the six lexical semantic models investigated, with relative gains of up to +13% over their first-order variants.

Spinning Straw into Gold: Using Free Text to Train Monolingual Alignment Models for Non-factoid Question Answering
Rebecca Sharp | Peter Jansen | Mihai Surdeanu | Peter Clark
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
Peter Jansen | Mihai Surdeanu | Peter Clark
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2005

Symmetric Probabilistic Alignment
Ralf D. Brown | Jae Dong Kim | Peter J. Jansen | Jaime G. Carbonell
Proceedings of the ACL Workshop on Building and Using Parallel Texts

Symmetric probabilistic alignment for example-based translation
Jae Dong Kim | Ralf D. Brown | Peter J. Jansen | Jaime G. Carbonell
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

2004

Developing Language Resources for a Transnational Digital Government System
Violetta Cavalli-Sforza | Jaime G. Carbonell | Peter J. Jansen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

Reducing boundary friction using translation-fragment overlap
Ralf D. Brown | Rebecca Hutchinson | Paul N. Bennett | Jaime G. Carbonell | Peter Jansen
Proceedings of Machine Translation Summit IX: Papers

Many corpus-based Machine Translation (MT) systems generate a number of partial translations which are then pieced together rather than immediately producing one overall translation. While this makes them more robust to ill-formed input, they are subject to disfluencies at phrasal translation boundaries even for well-formed input. We address this “boundary friction” problem by introducing a method that exploits overlapping phrasal translations and the increased confidence in translation accuracy they imply. We specify an efficient algorithm for producing translations using overlap. Finally, our empirical analysis indicates that this approach produces higher quality translations than the standard method of combining non-overlapping fragments generated by our Example-Based MT (EBMT) system in a peak-to-peak comparison.

Co-authors

Marc-Alexandre Côté 5

Jaime G. Carbonell 4

Rebecca Sharp 4

Oyvind Tafjord 4

Ralf D. Brown 3

Prithviraj Ammanabrolu 2

Michael Hammond 2

Bodhisattwa Prasad Majumder 2

Fragkiskos D. Malliaros 2

Steven Marmorstein 2

Jaycie Martin 2

Danilo Neves Ribeiro 2

Alexander Panchenko 2

Swapna Somasundaran 2

Sebastian Thiem 2

Elizabeth Wainwright 2

Niranjan Balasubramanian 1

Chris Callison-Burch 1

Violetta Cavalli-Sforza 1

Laura W. Dozal 1

Benjamin Van Durme 1

Roberto Furfaro 1

Goran Glavaš 1

Gus Hahn-Powell 1

Samiah Hassan 1

Ioana Hulpuș 1

Rebecca Hutchinson 1

Dongwei Jiang 1

Zheng Ping Jiang 1

Alice Saebom Kwak 1

Varvara Logacheva 1

Clayton T Morrison 1

Huitzilin Ortiz 1

Leighanna Pipatanangkura 1

Marissa Radensky 1

Shreya Sharma 1

Pao Siangliulue 1

Kelly J. Smith 1

Harish Tayyar Madabushi 1

Mokanarangan Thayaparan 1

Harsh Trivedi 1

Marco Valentino 1

Marco A. Valenzuela-Escárcega 1

Michalis Vazirgiannis 1

Nathaniel Weir 1

Daniel S. Weld 1

Venues