Siddharth Vashishtha


2024

pdf bib
FAMuS: Frames Across Multiple Sources
Siddharth Vashishtha | Alexander Martin | William Gantt | Benjamin Van Durme | Aaron White
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Understanding event descriptions is a central aspect of language processing, but current approaches focus overwhelmingly on single sentences or documents. Aggregating information about an event across documents can offer a much richer understanding. To this end, we present FAMuS, a new corpus of Wikipedia passages that report on some event, paired with underlying, genre-diverse (non-Wikipedia) source articles for the same event. Events and (cross-sentence) arguments in both report and source are annotated against FrameNet, providing broad coverage of different event types. We present results on two key event understanding tasks enabled by FAMuS: source validation—determining whether a document is a valid source for a target report event—and cross-document argument extraction—full-document argument extraction for a target event from both its report and the correct source article.

2023

pdf bib
On Event Individuation for Document-Level Information Extraction
William Gantt | Reno Kriz | Yunmo Chen | Siddharth Vashishtha | Aaron White
Findings of the Association for Computational Linguistics: EMNLP 2023

As information extraction (IE) systems have grown more adept at processing whole documents, the classic task of *template filling* has seen renewed interest as a benchmark for document-level IE. In this position paper, we call into question the suitability of template filling for this purpose. We argue that the task demands definitive answers to thorny questions of *event individuation* — the problem of distinguishing distinct events — about which even human experts disagree. Through an annotation study and error analysis, we show that this raises concerns about the usefulness of template filling metrics, the quality of datasets for the task, and the ability of models to learn it. Finally, we consider possible solutions.

pdf bib
PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs
Rahul Goel | Waleed Ammar | Aditya Gupta | Siddharth Vashishtha | Motoki Sano | Faiz Surani | Max Chang | HyunJeong Choe | David Greene | Chuan He | Rattima Nitisaroj | Anna Trukhina | Shachi Paul | Pararth Shah | Rushin Shah | Zhou Yu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user’s contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup.

2021

pdf bib
LOME: Large Ontology Multilingual Extraction
Patrick Xia | Guanghui Qin | Siddharth Vashishtha | Yunmo Chen | Tongfei Chen | Chandler May | Craig Harman | Kyle Rawlins | Aaron Steven White | Benjamin Van Durme
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We present LOME, a system for performing multilingual information extraction. Given a text document as input, our core system identifies spans of textual entity and event mentions with a FrameNet (Baker et al., 1998) parser. It subsequently performs coreference resolution, fine-grained entity typing, and temporal relation prediction between events. By doing so, the system constructs an event and entity focused knowledge graph. We can further apply third-party modules for other types of annotation, like relation extraction. Our (multilingual) first-party modules either outperform or are competitive with the (monolingual) state-of-the-art. We achieve this through the use of multilingual encoders like XLM-R (Conneau et al., 2020) and leveraging multilingual training data. LOME is available as a Docker container on Docker Hub. In addition, a lightweight version of the system is accessible as a web demo.

2020

pdf bib
The Universal Decompositional Semantics Dataset and Decomp Toolkit
Aaron Steven White | Elias Stengel-Eskin | Siddharth Vashishtha | Venkata Subrahmanyan Govindarajan | Dee Ann Reisinger | Tim Vieira | Keisuke Sakaguchi | Sheng Zhang | Francis Ferraro | Rachel Rudinger | Kyle Rawlins | Benjamin Van Durme
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the Universal Decompositional Semantics (UDS) dataset (v1.0), which is bundled with the Decomp toolkit (v0.1). UDS1.0 unifies five high-quality, decompositional semantics-aligned annotation sets within a single semantic graph specification—with graph structures defined by the predicative patterns produced by the PredPatt tool and real-valued node and edge attributes constructed using sophisticated normalization procedures. The Decomp toolkit provides a suite of Python 3 tools for querying UDS graphs using SPARQL. Both UDS1.0 and Decomp0.1 are publicly available at http://decomp.io.

pdf bib
Temporal Reasoning in Natural Language Inference
Siddharth Vashishtha | Adam Poliak | Yash Kumar Lal | Benjamin Van Durme | Aaron Steven White
Findings of the Association for Computational Linguistics: EMNLP 2020

We introduce five new natural language inference (NLI) datasets focused on temporal reasoning. We recast four existing datasets annotated for event duration—how long an event lasts—and event ordering—how events are temporally arranged—into more than one million NLI examples. We use these datasets to investigate how well neural models trained on a popular NLI corpus capture these forms of temporal reasoning.

2019

pdf bib
Fine-Grained Temporal Relation Extraction
Siddharth Vashishtha | Benjamin Van Durme | Aaron Steven White
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a novel semantic framework for modeling temporal relations and event durations that maps pairs of events to real-valued scales. We use this framework to construct the largest temporal relations dataset to date, covering the entirety of the Universal Dependencies English Web Treebank. We use this dataset to train models for jointly predicting fine-grained temporal relations and event durations. We report strong results on our data and show the efficacy of a transfer-learning approach for predicting categorical relations.