Nathanael Chambers

Also published as: Nathan Chambers

2025

pdf bib abs
Causal Graph based Event Reasoning using Semantic Relation Experts
Mahnaz Koupaee | Xueying Bai | Mudan Chen | Greg Durrett | Nathanael Chambers | Niranjan Balasubramanian
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding how events in a scenario causally connect with each other is important for effectively modeling and reasoning about events. But event reasoning remains a difficult challenge, and despite recent advances, Large Language Models (LLMs) still struggle to accurately identify causal connections between events. This struggle leads to poor performance on deeper reasoning tasks like event forecasting and timeline understanding. To address this challenge, we investigate the generation of causal event graphs (e.g., A enables B) as a parallel mechanism to help LLMs explicitly represent causality during inference. This paper evaluates both how to generate correct graphs as well as how graphs can assist reasoning. We propose a collaborative approach to causal graph generation where we use LLMs to simulate experts that focus on specific semantic relations. The experts engage in multiple rounds of discussions which are then consolidated by a final expert. Then, to demonstrate the utility of causal graphs, we use them on multiple downstream applications, and also introduce a new explainable event prediction task that requires a causal chain of events in the explanation. These explanations are more informative and coherent than baseline generations. Finally, our overall approach not finetuned on any downstream task, achieves competitive results with state-of-the-art models on both forecasting and next event prediction tasks.

2024

pdf bib abs
CaT-Bench: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans
Yash Kumar Lal | Vanya Cohen | Nathanael Chambers | Niranjan Balasubramanian | Ray Mooney
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Understanding the abilities of LLMs to reason about natural language plans, such as instructional text and recipes, is critical to reliably using them in decision-making systems. A fundamental aspect of plans is the temporal order in which their steps need to be executed, which reflects the underlying causal dependencies between them. We introduce CaT-Bench, a benchmark of Step Order Prediction questions, which test whether a step must necessarily occur before or after another in cooking recipe plans. We use this to evaluate how well frontier LLMs understand causal and temporal dependencies. We find that SOTA LLMs are underwhelming (best zero-shot is only 0.59 in F1), and are biased towards predicting dependence more often, perhaps relying on temporal order of steps as a heuristic. While prompting for explanations and using few-shot examples improve performance, the best F1 result is only 0.73. Further, human evaluation of explanations along with answer correctness show that, on average, humans do not agree with model reasoning. Surprisingly, we also find that explaining after answering leads to better performance than normal chain-of-thought prompting, and LLM answers are not consistent across questions about the same step pairs. Overall, results show that LLMs’ ability to detect dependence between steps has significant room for improvement.

2023

pdf bib abs
Modeling Complex Event Scenarios via Simple Entity-focused Questions
Mahnaz Koupaee | Greg Durrett | Nathanael Chambers | Niranjan Balasubramanian
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Event scenarios are often complex and involve multiple event sequences connected through different entity participants. Exploring such complex scenarios requires an ability to branch through different sequences, something that is difficult to achieve with standard event language modeling. To address this, we propose a question-guided generation framework that models events in complex scenarios as answers to questions about participants. At any step in the generation process, the framework uses the previously-generated events as context, but generates the next event as an answer to one of three questions: what else a participant did, what else happened to a participant, or what else happened. The participants and the questions themselves can be sampled or be provided as input from a user, allowing for controllable exploration. Our empirical evaluation shows that this question-guided generation provides better coverage of participants, diverse events within a domain, comparable perplexities for modeling event sequences, and more effective control for interactive schema generation.

Schema induction involves creating a graph representation depicting how events unfold in a scenario. We present SAGEViz, an intuitive and modular tool that utilizes human-AI collaboration to create and update complex schema graphs efficiently, where multiple annotators (humans and models) can work simultaneously on a schema graph from any domain. The tool consists of two components: (1) a curation component powered by plug-and-play event language models to create and expand event sequences while human annotators validate and enrich the sequences to build complex hierarchical schemas, and (2) an easy-to-use visualization component to visualize schemas at varying levels of hierarchy. Using supervised and few-shot approaches, our event language models can continually predict relevant events starting from a seed event. We conduct a user study and show that users need less effort in terms of interaction steps with SAGEViz to generate schemas of better quality. We also include a video demonstrating the system.

pdf bib abs
PASTA: A Dataset for Modeling PArticipant STAtes in Narratives
Sayontan Ghosh | Mahnaz Koupaee | Isabella Chen | Francis Ferraro | Nathanael Chambers | Niranjan Balasubramanian
Transactions of the Association for Computational Linguistics, Volume 11

The events in a narrative are understood as a coherent whole via the underlying states of their participants. Often, these participant states are not explicitly mentioned, instead left to be inferred by the reader. A model that understands narratives should likewise infer these implicit states, and even reason about the impact of changes to these states on the narrative. To facilitate this goal, we introduce a new crowdsourced English-language, Participant States dataset, PASTA. This dataset contains inferable participant states; a counterfactual perturbation to each state; and the changes to the story that would be necessary if the counterfactual were true. We introduce three state-based reasoning tasks that test for the ability to infer when a state is entailed by a story, to revise a story conditioned on a counterfactual state, and to explain the most likely state change given a revised story. Experiments show that today’s LLMs can reason about states to some degree, but there is large room for improvement, especially in problems requiring access and ability to reason with diverse types of knowledge (e.g., physical, numerical, factual).1

2022

Answering questions in narratives about why events happened often requires commonsense knowledge external to the text. What aspects of this knowledge are available in large language models? What aspects can be made accessible via external commonsense resources? We study these questions in the context of answering questions in the TellMeWhy dataset using COMET as a source of relevant commonsense relations. We analyze the effects of model size (T5 and GPT3) along with methods of injecting knowledge (COMET) into these models. Results show that the largest models, as expected, yield substantial improvements over base models. Injecting external knowledge helps models of various sizes, but the amount of improvement decreases with larger model size. We also find that the format in which knowledge is provided is critical, and that smaller models benefit more from larger amounts of knowledge. Finally, we develop an ontology of knowledge types and analyze the relative coverage of the models across these categories.

2021

pdf bib abs
Conditional Generation of Temporally-ordered Event Sequences
Shih-Ting Lin | Nathanael Chambers | Greg Durrett
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Models of narrative schema knowledge have proven useful for a range of event-related tasks, but they typically do not capture the temporal relationships between events. We propose a single model that addresses both temporal ordering, sorting given events into the order they occurred, and event infilling, predicting new events which fit into an existing temporally-ordered sequence. We use a BART-based conditional generation model that can capture both temporality and common event co-occurrence, meaning it can be flexibly applied to different tasks in this space. Our model is trained as a denoising autoencoder: we take temporally-ordered event sequences, shuffle them, delete some events, and then attempt to recover the original event sequence. This task teaches the model to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.

pdf bib abs
Don’t Let Discourse Confine Your Model: Sequence Perturbations for Improved Event Language Models
Mahnaz Koupaee | Greg Durrett | Nathanael Chambers | Niranjan Balasubramanian
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Event language models represent plausible sequences of events. Most existing approaches train autoregressive models on text, which successfully capture event co-occurrence but unfortunately constrain the model to follow the discourse order in which events are presented. Other domains may employ different discourse orders, and for many applications, we may care about different notions of ordering (e.g., temporal) or not care about ordering at all (e.g., when predicting related events in a schema). We propose a simple yet surprisingly effective strategy for improving event language models by perturbing event sequences so we can relax model dependence on text order. Despite generating completely synthetic event orderings, we show that this technique improves the performance of the event language models on both applications and out-of-domain events data.

pdf bib
TellMeWhy: A Dataset for Answering Why-Questions in Narratives
Yash Kumar Lal | Nathanael Chambers | Raymond Mooney | Niranjan Balasubramanian
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs
Toward Diverse Precondition Generation
Heeyoung Kwon | Nathanael Chambers | Niranjan Balasubramanian
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

A typical goal for language understanding is to logically connect the events of a discourse, but often connective events are not described due to their commonsense nature. In order to address this deficit, we focus here on generating precondition events. Precondition generation can be framed as a sequence-to-sequence problem: given a target event, generate a possible precondition. However, in most real-world scenarios, an event can have several preconditions, which is not always suitable for standard seq2seq frameworks. We propose DiP, the Diverse Precondition generation system that can generate unique and diverse preconditions. DiP consists of three stages of the generative process – an event sampler, a candidate generator, and a post-processor. The event sampler provides control codes (precondition triggers) which the candidate generator uses to focus its generation. Post-processing further improves the results through re-ranking and filtering. Unlike other conditional generation systems, DiP automatically generates control codes without training on diverse examples. Analysis reveals that DiP improves the diversity of preconditions significantly compared to a beam search baseline. Also, manual evaluation shows that DiP generates more preconditions than a strong nucleus sampling baseline.

2020

pdf bib abs
Modeling Label Semantics for Predicting Emotional Reactions
Radhika Gaonkar | Heeyoung Kwon | Mohaddeseh Bastan | Niranjan Balasubramanian | Nathanael Chambers
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Predicting how events induce emotions in the characters of a story is typically seen as a standard multi-label classification task, which usually treats labels as anonymous classes to predict. They ignore information that may be conveyed by the emotion labels themselves. We propose that the semantics of emotion labels can guide a model’s attention when representing the input story. Further, we observe that the emotions evoked by an event are often related: an event that evokes joy is unlikely to also evoke sadness. In this work, we explicitly model label classes via label embeddings, and add mechanisms that track label-label correlations both during training and inference. We also introduce a new semi-supervision strategy that regularizes for the correlations on unlabeled data. Our empirical evaluations show that modeling label semantics yields consistent benefits, and we advance the state-of-the-art on an emotion inference task.

pdf bib abs
Generating Narrative Text in a Switching Dynamical System
Noah Weber | Leena Shekhar | Heeyoung Kwon | Niranjan Balasubramanian | Nathanael Chambers
Proceedings of the 24th Conference on Computational Natural Language Learning

Early work on narrative modeling used explicit plans and goals to generate stories, but the language generation itself was restricted and inflexible. Modern methods use language models for more robust generation, but often lack an explicit representation of the scaffolding and dynamics that guide a coherent narrative. This paper introduces a new model that integrates explicit narrative structure with neural language models, formalizing narrative modeling as a Switching Linear Dynamical System (SLDS). A SLDS is a dynamical system in which the latent dynamics of the system (i.e. how the state vector transforms over time) is controlled by top-level discrete switching variables. The switching variables represent narrative structure (e.g., sentiment or discourse states), while the latent state vector encodes information on the current state of the narrative. This probabilistic formulation allows us to control generation, and can be learned in a semi-supervised fashion using both labeled and unlabeled data. Additionally, we derive a Gibbs sampler for our model that can “fill in” arbitrary parts of the narrative, guided by the switching variables. Our filled-in (English language) narratives outperform several baselines on both automatic and human evaluations

Event schemas can guide our understanding and ability to make predictions with respect to what might happen next. We propose a new Event Graph Schema, where two event types are connected through multiple paths involving entities that fill important roles in a coherent story. We then introduce Path Language Model, an auto-regressive language model trained on event-event paths, and select salient and coherent paths to probabilistically construct these graph schemas. We design two evaluation metrics, instance coverage and instance coherence, to evaluate the quality of graph schema induction, by checking when coherent event instances are covered by the schema graph. Intrinsic evaluations show that our approach is highly effective at inducing salient and coherent schemas. Extrinsic evaluations show the induced schema repository provides significant improvement to downstream end-to-end Information Extraction over a state-of-the-art joint neural extraction model, when used as additional global features to unfold instance graphs.

Preconditions provide a form of logical connection between events that explains why some events occur together and information that is complementary to the more widely studied relations such as causation, temporal ordering, entailment, and discourse relations. Modeling preconditions in text has been hampered in part due to the lack of large scale labeled data grounded in text. This paper introduces PeKo, a crowd-sourced annotation of preconditions between event pairs in newswire, an order of magnitude larger than prior text annotations. To complement this new corpus, we also introduce two challenge tasks aimed at modeling preconditions: (i) Precondition Identification – a standard classification task defined over pairs of event mentions, and (ii) Precondition Generation – a generative task aimed at testing a more general ability to reason about a given event. Evaluation on both tasks shows that modeling preconditions is challenging even for today’s large language models (LM). This suggests that precondition knowledge is not easily accessible in LM-derived representations alone. Our generation results show that fine-tuning an LM on PeKo yields better conditional relations than when trained on raw text or temporally-ordered corpora.

2019

pdf bib abs
Character-Based Models for Adversarial Phone Extraction: Preventing Human Sex Trafficking
Nathanael Chambers | Timothy Forman | Catherine Griswold | Kevin Lu | Yogaish Khastgir | Stephen Steckler
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Illicit activity on the Web often uses noisy text to obscure information between client and seller, such as the seller’s phone number. This presents an interesting challenge to language understanding systems; how do we model adversarial noise in a text extraction system? This paper addresses the sex trafficking domain, and proposes some of the first neural network architectures to learn and extract phone numbers from noisy text. We create a new adversarial advertisement dataset, propose several RNN-based models to solve the problem, and most notably propose a visual character language model to interpret unseen unicode characters. We train a CRF jointly with a CNN to improve number recognition by 89% over just a CRF. Through data augmentation in this unique model, we present the first results on characters never seen in training.

2018

pdf bib abs
Hierarchical Quantized Representations for Script Generation
Noah Weber | Leena Shekhar | Niranjan Balasubramanian | Nathanael Chambers
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Scripts define knowledge about how everyday scenarios (such as going to a restaurant) are expected to unfold. One of the challenges to learning scripts is the hierarchical nature of the knowledge. For example, a suspect arrested might plead innocent or guilty, and a very different track of events is then expected to happen. To capture this type of information, we propose an autoencoder model with a latent space defined by a hierarchy of categorical variables. We utilize a recently proposed vector quantization based approach, which allows continuous embeddings to be associated with each latent variable value. This permits the decoder to softly decide what portions of the latent hierarchy to condition on by attending over the value embeddings for a given setting. Our model effectively encodes and generates scripts, outperforming a recent language modeling-based method on several standard tasks, and allowing the autoencoder model to achieve substantially lower perplexity scores compared to the previous language modeling-based method.

pdf bib abs
Detecting Denial-of-Service Attacks from Social Media Text: Applying NLP to Computer Security
Nathanael Chambers | Ben Fry | James McMasters
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

This paper describes a novel application of NLP models to detect denial of service attacks using only social media as evidence. Individual networks are often slow in reporting attacks, so a detection system from public data could better assist a response to a broad attack across multiple services. We explore NLP methods to use social media as an indirect measure of network service status. We describe two learning frameworks for this task: a feed-forward neural network and a partially labeled LDA model. Both models outperform previous work by significant margins (20% F1 score). We further show that the topic-based model enables the first fine-grained analysis of how the public reacts to ongoing network attacks, discovering multiple “stages” of observation. This is the first model that both detects network attacks (with best performance) and provides an analysis of when and how the public interprets service outages. We describe the models, present experiments on the largest twitter DDoS corpus to date, and conclude with an analysis of public reactions based on the learned model’s output.

This paper presents a new method for learning typed entailment graphs from text. We extract predicate-argument structures from multiple-source news corpora, and compute local distributional similarity scores to learn entailments between predicates with typed arguments (e.g., person contracted disease). Previous work has used transitivity constraints to improve local decisions, but these constraints are intractable on large graphs. We instead propose a scalable method that learns globally consistent similarity scores based on new soft constraints that consider both the structures across typed entailment graphs and inside each graph. Learning takes only a few hours to run over 100K predicates and our results show large improvements over local similarity scores on two entailment data sets. We further show improvements over paraphrases and entailments from the Paraphrase Database, and prior state-of-the-art entailment graphs. We show that the entailment graphs improve performance in a downstream task.

2017

pdf bib abs
Event Ordering with a Generalized Model for Sieve Prediction Ranking
Bill McDowell | Nathanael Chambers | Alexander Ororbia II | David Reitter
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper improves on several aspects of a sieve-based event ordering architecture, CAEVO (Chambers et al., 2014), which creates globally consistent temporal relations between events and time expressions. First, we examine the usage of word embeddings and semantic role features. With the incorporation of these new features, we demonstrate a 5% relative F1 gain over our replicated version of CAEVO. Second, we reformulate the architecture’s sieve-based inference algorithm as a prediction reranking method that approximately optimizes a scoring function computed using classifier precisions. Within this prediction reranking framework, we propose an alternative scoring function, showing an 8.8% relative gain over the original CAEVO. We further include an in-depth analysis of one of the main datasets that is used to evaluate temporal classifiers, and we show how despite using the densest corpus, there is still a danger of overfitting. While this paper focuses on temporal ordering, its results are applicable to other areas that use sieve-based architectures.

pdf bib
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics
Michael Roth | Nasrin Mostafazadeh | Nathanael Chambers | Annie Louis
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

pdf bib abs
Behind the Scenes of an Evolving Event Cloze Test
Nathanael Chambers
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

This paper analyzes the narrative event cloze test and its recent evolution. The test removes one event from a document’s chain of events, and systems predict the missing event. Originally proposed to evaluate learned knowledge of event scenarios (e.g., scripts and frames), most recent work now builds ngram-like language models (LM) to beat the test. This paper argues that the test has slowly/unknowingly been altered to accommodate LMs.5 Most notably, tests are auto-generated rather than by hand, and no effort is taken to include core script events. Recent work is not clear on evaluation goals and contains contradictory results. We implement several models, and show that the test’s bias to high-frequency events explains the inconsistencies. We conclude with recommendations on how to return to the test’s original intent, and offer brief suggestions on a path forward.

pdf bib abs
LSDSem 2017 Shared Task: The Story Cloze Test
Nasrin Mostafazadeh | Michael Roth | Annie Louis | Nathanael Chambers | James Allen
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

The LSDSem’17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending to the story. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge. A total of eight systems participated in the shared task, with a variety of approaches including.

pdf bib abs
Aligning Entity Names with Online Aliases on Twitter
Kevin McKelvey | Peter Goutzounis | Stephen da Cruz | Nathanael Chambers
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media

This paper presents new models that automatically align online aliases with their real entity names. Many research applications rely on identifying entity names in text, but people often refer to entities with unexpected nicknames and aliases. For example, The King and King James are aliases for Lebron James, a professional basketball player. Recent work on entity linking attempts to resolve mentions to knowledge base entries, like a wikipedia page, but linking is unfortunately limited to well-known entities with pre-built pages. This paper asks a more basic question: can aliases be aligned without background knowledge of the entity? Further, can the semantics surrounding alias mentions be used to inform alignments? We describe statistical models that make decisions based on the lexicographic properties of the aliases with their semantic context in a large corpus of tweets. We experiment on a database of Twitter users and their usernames, and present the first human evaluation for this task. Alignment accuracy approaches human performance at 81%, and we show that while lexicographic features are most important, the semantic context of an alias further improves classification accuracy.

The past 10 years of event ordering research has focused on learning partial orderings over document events and time expressions. The most popular corpus, the TimeBank, contains a small subset of the possible ordering graph. Many evaluations follow suit by only testing certain pairs of events (e.g., only main verbs of neighboring sentences). This has led most research to focus on specific learners for partial labelings. This paper attempts to nudge the discussion from identifying some relations to all relations. We present new experiments on strongly connected event graphs that contain ∼10 times more relations per document than the TimeBank. We also describe a shift away from the single learner to a sieve-based architecture that naturally blends multiple learners into a precision-ranked cascade of sieves. Each sieve adds labels to the event graph one at a time, and earlier sieves inform later ones through transitive closure. This paper thus describes innovations in both approach and task. We experiment on the densest event graphs to date and show a 14% gain over state-of-the-art.

2013

pdf bib
Event Schema Induction with a Probabilistic Entity-Driven Model
Nathanael Chambers
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
NavyTime: Event and Time Ordering from Raw Text
Nathanael Chambers
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
USNA: A Dual-Classifier Approach to Contextual Sentiment Analysis
Ganesh Harihara | Eugene Yang | Nathanael Chambers
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter
Micol Marchetti-Bowick | Nathanael Chambers
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Labeling Documents with Timestamps: Learning from their Time Expressions
Nathanael Chambers
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2011

pdf bib
Template-Based Information Extraction without the Templates
Nathanael Chambers | Dan Jurafsky
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task
Heeyoung Lee | Yves Peirsman | Angel Chang | Nathanael Chambers | Mihai Surdeanu | Dan Jurafsky
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

2010

pdf bib abs
A Database of Narrative Schemas
Nathanael Chambers | Dan Jurafsky
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes a new language resource of events and semantic roles that characterize real-world situations. Narrative schemas contain sets of related events (edit and publish), a temporal ordering of the events (edit before publish), and the semantic roles of the participants (authors publish books). This type of world knowledge was central to early research in natural language understanding, scripts being one of the main formalisms, they represented common sequences of events that occur in the world. Unfortunately, most of this knowledge was hand-coded and time consuming to create. Current machine learning techniques, as well as a new approach to learning through coreference chains, has allowed us to automatically extract rich event structure from open domain text in the form of narrative schemas. The narrative schema resource described in this paper contains approximately 5000 unique events combined into schemas of varying sizes. We describe the resource, how it is learned, and a new evaluation of the coverage of these schemas over unseen documents.

pdf bib
Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
Nathanael Chambers | Daniel Jurafsky
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics