Ben Zhou


pdf bib
End-to-End Chinese Speaker Identification
Dian Yu | Ben Zhou | Dong Yu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Speaker identification (SI) in texts aims to identify the speaker(s) for each utterance in texts. Previous studies divide SI into several sub-tasks (e.g., quote extraction, named entity recognition, gender identification, and coreference resolution). However, we are still far from solving these sub-tasks, making SI systems that rely on them seriously suffer from error propagation. End-to-end SI systems, on the other hand, are not limited by individual modules, but suffer from insufficient training data from the existing small-scale datasets. To make large end-to-end models possible, we design a new annotation guideline that regards SI as span extraction from the local context, and we annotate by far the largest SI dataset for Chinese named CSI based on eighteen novels. Viewing SI as a span selection task also introduces the possibility of applying existing storng extractive machine reading comprehension (MRC) baselines. Surprisingly, simply using such a baseline without human-annotated character names and carefully designed rules, we can already achieve performance comparable or better than those of previous state-of-the-art SI methods on all public SI datasets for Chinese. Furthermore, we show that our dataset can serve as additional training data for existing benchmarks, which leads to further gains (up to 6.5% in accuracy). Finally, using CSI as a clean source, we design an effective self-training paradigm to continuously leverage hundreds of unlabeled novels.

pdf bib
RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
Xinya Du | Zixuan Zhang | Sha Li | Pengfei Yu | Hongwei Wang | Tuan Lai | Xudong Lin | Ziqi Wang | Iris Liu | Ben Zhou | Haoyang Wen | Manling Li | Darryl Hannan | Jie Lei | Hyounghun Kim | Rotem Dror | Haoyu Wang | Michael Regan | Qi Zeng | Qing Lyu | Charles Yu | Carl Edwards | Xiaomeng Jin | Yizhu Jiao | Ghazaleh Kazeminejad | Zhenhailong Wang | Chris Callison-Burch | Mohit Bansal | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Martha Palmer | Heng Ji
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We introduce RESIN-11, a new schema-guided event extraction&prediction framework that can be applied to a large variety of newsworthy scenarios. The framework consists of two parts: (1) an open-domain end-to-end multimedia multilingual information extraction system with weak-supervision and zero-shot learningbased techniques. (2) schema matching and schema-guided event prediction based on our curated schema library. We build a demo website based on our dockerized system and schema library publicly available for installation ( We also include a video demonstrating the system.

pdf bib
New Frontiers of Information Extraction
Muhao Chen | Lifu Huang | Manling Li | Ben Zhou | Heng Ji | Dan Roth
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

This tutorial targets researchers and practitioners who are interested in AI and ML technologies for structural information extraction (IE) from unstructured textual sources. Particularly, this tutorial will provide audience with a systematic introduction to recent advances of IE, by answering several important research questions. These questions include (i) how to develop an robust IE system from noisy, insufficient training data, while ensuring the reliability of its prediction? (ii) how to foster the generalizability of IE through enhancing the system’s cross-lingual, cross-domain, cross-task and cross-modal transferability? (iii) how to precisely support extracting structural information with extremely fine-grained, diverse and boundless labels? (iv) how to further improve IE by leveraging indirect supervision from other NLP tasks, such as NLI, QA or summarization, and pre-trained language models? (v) how to acquire knowledge to guide the inference of IE systems? We will discuss several lines of frontier research that tackle those challenges, and will conclude the tutorial by outlining directions for further investigation.

pdf bib
There’s a Time and Place for Reasoning Beyond the Image
Xingyu Fu | Ben Zhou | Ishaan Chandratreya | Carl Vondrick | Dan Roth
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Images are often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture. For example, in Figure 1, we can find a way to identify the news articles related to the picture through segment-wise understandings of the signs, the buildings, the crowds, and more. This reasoning could provide the time and place the image was taken, which will help us in subsequent tasks, such as automatic storyline construction, correction of image source in intended effect photographs, and upper-stream processing such as image clustering for certain location or time.In this work, we formulate this problem and introduce TARA: a dataset with 16k images with their associated news, time, and location, automatically extracted from New York Times, and an additional 61k examples as distant supervision from WIT. On top of the extractions, we present a crowdsourced subset in which we believe it is possible to find the images’ spatio-temporal information for evaluation purpose. We show that there exists a 70% gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge.The data and code are publicly available at

pdf bib
A Meta-framework for Spatiotemporal Quantity Extraction from Text
Qiang Ning | Ben Zhou | Hao Wu | Haoruo Peng | Chuchu Fan | Matt Gardner
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

News events are often associated with quantities (e.g., the number of COVID-19 patients or the number of arrests in a protest), and it is often important to extract their type, time, and location from unstructured text in order to analyze these quantity events. This paper thus formulates the NLP problem of spatiotemporal quantity extraction, and proposes the first meta-framework for solving it. This meta-framework contains a formalism that decomposes the problem into several information extraction tasks, a shareable crowdsourcing pipeline, and transformer-based baseline models. We demonstrate the meta-framework in three domains—the COVID-19 pandemic, Black Lives Matter protests, and 2020 California wildfires—to show that the formalism is general and extensible, the crowdsourcing pipeline facilitates fast and high-quality data annotation, and the baseline system can handle spatiotemporal quantity extraction well enough to be practically useful. We release all resources for future research on this topic at


pdf bib
Temporal Reasoning on Implicit Events from Distant Supervision
Ben Zhou | Kyle Richardson | Qiang Ning | Tushar Khot | Ashish Sabharwal | Dan Roth
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose TRACIE, a novel temporal reasoning dataset that evaluates the degree to which systems understand implicit events—events that are not mentioned explicitly in natural language text but can be inferred from it. This introduces a new challenge in temporal reasoning research, where prior work has focused on explicitly mentioned events. Human readers can infer implicit events via commonsense reasoning, resulting in a more comprehensive understanding of the situation and, consequently, better reasoning about time. We find, however, that state-of-the-art models struggle when predicting temporal relationships between implicit and explicit events. To address this, we propose a neuro-symbolic temporal reasoning model, SymTime, which exploits distant supervision signals from large-scale text and uses temporal rules to combine start times and durations to infer end times. SymTime outperforms strong baseline systems on TRACIE by 5%, and by 11% in a zero prior knowledge training setting. Our approach also generalizes to other temporal reasoning tasks, as evidenced by a gain of 1%-9% on MATRES, an explicit event benchmark.

pdf bib
RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System
Haoyang Wen | Ying Lin | Tuan Lai | Xiaoman Pan | Sha Li | Xudong Lin | Ben Zhou | Manling Li | Haoyu Wang | Hongming Zhang | Xiaodong Yu | Alexander Dong | Zhenhailong Wang | Yi Fung | Piyush Mishra | Qing Lyu | Dídac Surís | Brian Chen | Susan Windisch Brown | Martha Palmer | Chris Callison-Burch | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Heng Ji
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.

pdf bib
Cross-lingual Entity Alignment with Incidental Supervision
Muhao Chen | Weijia Shi | Ben Zhou | Dan Roth
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Much research effort has been put to multilingual knowledge graph (KG) embedding methods to address the entity alignment task, which seeks to match entities in different languagespecific KGs that refer to the same real-world object. Such methods are often hindered by the insufficiency of seed alignment provided between KGs. Therefore, we propose a new model, JEANS , which jointly represents multilingual KGs and text corpora in a shared embedding scheme, and seeks to improve entity alignment with incidental supervision signals from text. JEANS first deploys an entity grounding process to combine each KG with the monolingual text corpus. Then, two learning processes are conducted: (i) an embedding learning process to encode the KG and text of each language in one embedding space, and (ii) a self-learning based alignment learning process to iteratively induce the correspondence of entities and that of lexemes between embeddings. Experiments on benchmark datasets show that JEANS leads to promising improvement on entity alignment with incidental supervision, and significantly outperforms state-of-the-art methods that solely rely on internal information of KGs.


pdf bib
Temporal Common Sense Acquisition with Minimal Supervision
Ben Zhou | Qiang Ning | Daniel Khashabi | Dan Roth
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Temporal common sense (e.g., duration and frequency of events) is crucial for understanding natural language. However, its acquisition is challenging, partly because such information is often not expressed explicitly in text, and human annotation on such concepts is costly. This work proposes a novel sequence modeling approach that exploits explicit and implicit mentions of temporal common sense, extracted from a large corpus, to build TacoLM, a temporal common sense language model. Our method is shown to give quality predictions of various dimensions of temporal common sense (on UDST and a newly collected dataset from RealNews). It also produces representations of events for relevant tasks such as duration comparison, parent-child relations, event coreference and temporal QA (on TimeBank, HiEVE and MCTACO) that are better than using the standard BERT. Thus, it will be an important component of temporal NLP.

pdf bib
Evaluating Models’ Local Decision Boundaries via Contrast Sets
Matt Gardner | Yoav Artzi | Victoria Basmov | Jonathan Berant | Ben Bogin | Sihao Chen | Pradeep Dasigi | Dheeru Dua | Yanai Elazar | Ananth Gottumukkala | Nitish Gupta | Hannaneh Hajishirzi | Gabriel Ilharco | Daniel Khashabi | Kevin Lin | Jiangming Liu | Nelson F. Liu | Phoebe Mulcaire | Qiang Ning | Sameer Singh | Noah A. Smith | Sanjay Subramanian | Reut Tsarfaty | Eric Wallace | Ally Zhang | Ben Zhou
Findings of the Association for Computational Linguistics: EMNLP 2020

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.


pdf bib
“Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding
Ben Zhou | Daniel Khashabi | Qiang Ning | Dan Roth
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Understanding time is crucial for understanding events expressed in natural language. Because people rarely say the obvious, it is often necessary to have commonsense knowledge about various temporal aspects of events, such as duration, frequency, and temporal order. However, this important problem has so far received limited attention. This paper systematically studies this temporal commonsense problem. Specifically, we define five classes of temporal commonsense, and use crowdsourcing to develop a new dataset, MCTACO, that serves as a test set for this task. We find that the best current methods used on MCTACO are still far behind human performance, by about 20%, and discuss several directions for improvement. We hope that the new dataset and our study here can foster more future research on this topic.


pdf bib
CogCompNLP: Your Swiss Army Knife for NLP
Daniel Khashabi | Mark Sammons | Ben Zhou | Tom Redman | Christos Christodoulopoulos | Vivek Srikumar | Nicholas Rizzolo | Lev Ratinov | Guanheng Luo | Quang Do | Chen-Tse Tsai | Subhro Roy | Stephen Mayhew | Zhili Feng | John Wieting | Xiaodong Yu | Yangqiu Song | Shashank Gupta | Shyam Upadhyay | Naveen Arivazhagan | Qiang Ning | Shaoshi Ling | Dan Roth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Zero-Shot Open Entity Typing as Type-Compatible Grounding
Ben Zhou | Daniel Khashabi | Chen-Tse Tsai | Dan Roth
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The problem of entity-typing has been studied predominantly as a supervised learning problems, mostly with task-specific annotations (for coarse types) and sometimes with distant supervision (for fine types). While such approaches have strong performance within datasets they often lack the flexibility to transfer across text genres and to generalize to new type taxonomies. In this work we propose a zero-shot entity typing approach that requires no annotated data and can flexibly identify newly defined types. Given a type taxonomy, the entries of which we define as Boolean functions of freebase “types,” we ground a given mention to a set of type-compatible Wikipedia entries, and then infer the target mention’s type using an inference algorithm that makes use of the types of these entries. We evaluate our system on a broad range of datasets, including standard fine-grained and coarse-grained entity typing datasets, and on a dataset in the biological domain. Our system is shown to be competitive with state-of-the-art supervised NER systems, and to outperform them on out-of-training datasets. We also show that our system significantly outperforms other zero-shot fine typing systems.

pdf bib
CogCompTime: A Tool for Understanding Time in Natural Language
Qiang Ning | Ben Zhou | Zhili Feng | Haoruo Peng | Dan Roth
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Automatic extraction of temporal information is important for natural language understanding. It involves two basic tasks: (1) Understanding time expressions that are mentioned explicitly in text (e.g., February 27, 1998 or tomorrow), and (2) Understanding temporal information that is conveyed implicitly via relations. This paper introduces CogCompTime, a system that has these two important functionalities. It incorporates the most recent progress, achieves state-of-the-art performance, and is publicly available at