Sha Li


pdf bib
A Language-First Approach for Procedure Planning
Jiateng Liu | Sha Li | Zhenhailong Wang | Manling Li | Heng Ji
Findings of the Association for Computational Linguistics: ACL 2023

Procedure planning, or the ability to predict a series of steps that can achieve a given goal conditioned on the current observation, is critical for building intelligent embodied agents that can assist users in everyday tasks. Encouraged by the recent success of language models (LMs) for zero-shot and few-shot planning, we hypothesize that LMs may be equipped with stronger priors for planning compared to their visual counterparts. To this end, we propose a language-first procedure planning framework with a modularized design: we first align the current and goal observations with corresponding steps and then use a pre-trained LM to predict the intermediate steps. Under this framework, we find that using an image captioning model for alignment can already match state-of-the-art performance and by designing a double retrieval model conditioned over current and goal observations jointly, we can achieve large improvements (19.2%-98.9% relatively higher success rate than state-of-the-art) on both COIN and CrossTask benchmarks. Our work verifies the planning ability of LMs and demonstrates how LMs can serve as a powerful “reasoning engine” even when the input is provided in another modality.

pdf bib
OpenPI-C: A Better Benchmark and Stronger Baseline for Open-Vocabulary State Tracking
Xueqing Wu | Sha Li | Heng Ji
Findings of the Association for Computational Linguistics: ACL 2023

Open-vocabulary state tracking is a more practical version of state tracking that aims to track state changes of entities throughout a process without restricting the state space and entity space. OpenPI (Tandon et al., 2020) is to date the only dataset annotated for open-vocabulary state tracking. However, we identify issues with the dataset quality and evaluation metric. For the dataset, we categorize 3 types of problems on the procedure level, step level and state change level respectively, and build a clean dataset OpenPI-C using multiple rounds of human judgment. For the evaluation metric, we propose a cluster-based metric to fix the original metric’s preference for repetition. Model-wise, we enhance the seq2seq generation baseline by reinstating two key properties for state tracking: temporal dependency and entity awareness. The state of the world after an action is inherently dependent on the previous state. We model this dependency through a dynamic memory bank and allow the model to attend to the memory slots during decoding. On the other hand, the state of the world is naturally a union of the states of involved entities. Since the entities are unknown in the open-vocabulary setting, we propose a two-stage model that refines the state change prediction conditioned on entities predicted from the first stage. Empirical results show the effectiveness of our proposed model, especially on the cleaned dataset and the cluster-based metric. The code and data are released at

pdf bib
Code4Struct: Code Generation for Few-Shot Event Structure Prediction
Xingyao Wang | Sha Li | Heng Ji
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Model (LLM) trained on a mixture of text and code has demonstrated impressive capability in translating natural language (NL) into structured code. We observe that semantic structures can be conveniently translated into code and propose Code4Struct to leverage such text-to-structure translation capability to tackle structured prediction tasks. As a case study, we formulate Event Argument Extraction (EAE) as converting text into event-argument structures that can be represented as a class object using code. This alignment between structures and code enables us to take advantage of Programming Language (PL) features such as inheritance and type annotation to introduce external knowledge or add constraints. We show that, with sufficient in-context examples, formulating EAE as a code generation problem is advantageous over using variants of text-based prompts. Despite only using 20 training event instances for each event type, Code4Struct is comparable to supervised models trained on 4,202 instances and outperforms current state-of-the-art (SOTA) trained on 20-shot data by 29.5% absolute F1. Code4Struct can use 10-shot training data from a sibling event type to predict arguments for zero-resource event types and outperforms the zero-shot baseline by 12% absolute F1.

pdf bib
Non-Sequential Graph Script Induction via Multimedia Grounding
Yu Zhou | Sha Li | Manling Li | Xudong Lin | Shih-Fu Chang | Mohit Bansal | Heng Ji
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating “branching”. In this paper, we propose the new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to WikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.

pdf bib
Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification
Sha Li | Ruining Zhao | Manling Li | Heng Ji | Chris Callison-Burch | Jiawei Han
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Event schemas are a form of world knowledge about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents, and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of commonsense knowledge that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method IncPrompt to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, IncSchema can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover ~10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability.

pdf bib
Human-in-the-loop Schema Induction
Tianyi Zhang | Isaac Tham | Zhaoyi Hou | Jiaxuan Ren | Leon Zhou | Hainiu Xu | Li Zhang | Lara Martin | Rotem Dror | Sha Li | Heng Ji | Martha Palmer | Susan Windisch Brown | Reece Suchocki | Chris Callison-Burch
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction (IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface.


pdf bib
Enhancing Knowledge Selection for Grounded Dialogues via Document Semantic Graphs
Sha Li | Mahdi Namazifar | Di Jin | Mohit Bansal | Heng Ji | Yang Liu | Dilek Hakkani-Tur
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Providing conversation models with background knowledge has been shown to make open-domain dialogues more informative and engaging. Existing models treat knowledge selection as a sentence ranking or classification problem where each sentence is handled individually, ignoring the internal semantic connection between sentences. In this work, we propose to automatically convert the background knowledge documents into document semantic graphs and then perform knowledge selection over such graphs. Our document semantic graphs preserve sentence-level information through the use of sentence nodes and provide concept connections between sentences. We apply multi-task learning to perform sentence-level knowledge selection and concept-level knowledge selection, showing that it improves sentence-level selection. Our experiments show that our semantic graph-based knowledge selection improves over sentence selection baselines for both the knowledge selection task and the end-to-end response generation task on HollE and improves generalization on unseen topics in WoW.

pdf bib
RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
Xinya Du | Zixuan Zhang | Sha Li | Pengfei Yu | Hongwei Wang | Tuan Lai | Xudong Lin | Ziqi Wang | Iris Liu | Ben Zhou | Haoyang Wen | Manling Li | Darryl Hannan | Jie Lei | Hyounghun Kim | Rotem Dror | Haoyu Wang | Michael Regan | Qi Zeng | Qing Lyu | Charles Yu | Carl Edwards | Xiaomeng Jin | Yizhu Jiao | Ghazaleh Kazeminejad | Zhenhailong Wang | Chris Callison-Burch | Mohit Bansal | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Martha Palmer | Heng Ji
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We introduce RESIN-11, a new schema-guided event extraction&prediction framework that can be applied to a large variety of newsworthy scenarios. The framework consists of two parts: (1) an open-domain end-to-end multimedia multilingual information extraction system with weak-supervision and zero-shot learningbased techniques. (2) schema matching and schema-guided event prediction based on our curated schema library. We build a demo website based on our dockerized system and schema library publicly available for installation ( We also include a video demonstrating the system.

pdf bib
Dynamic Global Memory for Document-level Argument Extraction
Xinya Du | Sha Li | Heng Ji
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Extracting informative arguments of events from news articles is a challenging problem in information extraction, which requires a global contextual understanding of each document. While recent work on document-level extraction has gone beyond single-sentence and increased the cross-sentence inference capability of end-to-end models, they are still restricted by certain input sequence length constraints and usually ignore the global context between events. To tackle this issue, we introduce a new global neural generation-based framework for document-level event argument extraction by constructing a document memory store to record the contextual event information and leveraging it to implicitly and explicitly help with decoding of arguments for later events. Empirical results show that our framework outperforms prior methods substantially and it is more robust to adversarially annotated examples with our constrained decoding design.

pdf bib
Eider: Empowering Document-level Relation Extraction with Efficient Evidence Extraction and Inference-stage Fusion
Yiqing Xie | Jiaming Shen | Sha Li | Yuning Mao | Jiawei Han
Findings of the Association for Computational Linguistics: ACL 2022

Document-level relation extraction (DocRE) aims to extract semantic relations among entity pairs in a document. Typical DocRE methods blindly take the full document as input, while a subset of the sentences in the document, noted as the evidence, are often sufficient for humans to predict the relation of an entity pair. In this paper, we propose an evidence-enhanced framework, Eider, that empowers DocRE by efficiently extracting evidence and effectively fusing the extracted evidence in inference. We first jointly train an RE model with a lightweight evidence extraction model, which is efficient in both memory and runtime. Empirically, even training the evidence model on silver labels constructed by our heuristic rules can lead to better RE performance. We further design a simple yet effective inference process that makes RE predictions on both extracted evidence and the full document, then fuses the predictions through a blending layer. This allows Eider to focus on important sentences while still having access to the complete information in the document. Extensive experiments show that Eider outperforms state-of-the-art methods on three benchmark datasets (e.g., by 1.37/1.26 Ign F1/F1 on DocRED).

pdf bib
Open-Vocabulary Argument Role Prediction For Event Extraction
Yizhu Jiao | Sha Li | Yiqing Xie | Ming Zhong | Heng Ji | Jiawei Han
Findings of the Association for Computational Linguistics: EMNLP 2022

The argument role in event extraction refers to the relation between an event and an argument participating in it. Despite the great progress in event extraction, existing studies still depend on roles pre-defined by domain experts. These studies expose obvious weakness when extending to emerging event types or new domains without available roles. Therefore, more attention and effort needs to be devoted to automatically customizing argument roles. In this paper, we define this essential but under-explored task: open-vocabulary argument role prediction. The goal of this task is to infer a set of argument roles for a given event type. We propose a novel unsupervised framework, RolePred for this task. Specifically, we formulate the role prediction problem as an in-filling task and construct prompts for a pre-trained language model to generate candidate roles. By extracting and analyzing the candidate arguments, the event-specific roles are further merged and selected. To standardize the research of this task, we collect a new human-annotated event extraction dataset including 143 customized argument roles with rich semantics. On this dataset, RolePred outperforms the existing methods by a large margin.

pdf bib
Open Relation and Event Type Discovery with Type Abstraction
Sha Li | Heng Ji | Jiawei Han
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Conventional “closed-world” information extraction (IE) approaches rely on human ontologies to define the scope for extraction. As a result, such approaches fall short when applied to new domains. This calls for systems that can automatically infer new types from given corpora, a task which we refer to as type discovery. To tackle this problem, we introduce the idea of type abstraction, where the model is prompted to generalize and name the type. Then we use the similarity between inferred names to induce clusters. Observing that this abstraction-based representation is often complementary to the entity/trigger token representation, we set up these two representations as two views and design our model as a co-training framework. Our experiments on multiple relation extraction and event extraction datasets consistently show the advantage of our type abstraction approach.


pdf bib
The Future is not One-dimensional: Complex Event Schema Induction by Graph Modeling for Event Prediction
Manling Li | Sha Li | Zhenhailong Wang | Lifu Huang | Kyunghyun Cho | Heng Ji | Jiawei Han | Clare Voss
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Event schemas encode knowledge of stereotypical structures of events and their connections. As events unfold, schemas are crucial to act as a scaffolding. Previous work on event schema induction focuses either on atomic events or linear temporal event sequences, ignoring the interplay between events via arguments and argument relations. We introduce a new concept of Temporal Complex Event Schema: a graph-based schema representation that encompasses events, arguments, temporal connections and argument relations. In addition, we propose a Temporal Event Graph Model that predicts event instances following the temporal complex event schema. To build and evaluate such schemas, we release a new schema learning corpus containing 6,399 documents accompanied with event graphs, and we have manually constructed gold-standard schemas. Intrinsic evaluations by schema matching and instance graph perplexity, prove the superior quality of our probabilistic graph schema library compared to linear representations. Extrinsic evaluation on schema-guided future event prediction further demonstrates the predictive power of our event graph model, significantly outperforming human schemas and baselines by more than 17.8% on HITS@1.

pdf bib
Document-Level Event Argument Extraction by Conditional Generation
Sha Li | Heng Ji | Jiawei Han
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Event extraction has long been treated as a sentence-level task in the IE community. We argue that this setting does not match human informative seeking behavior and leads to incomplete and uninformative extraction results. We propose a document-level neural event argument extraction model by formulating the task as conditional generation following event templates. We also compile a new document-level event extraction benchmark dataset WikiEvents which includes complete event and coreference annotation. On the task of argument extraction, we achieve an absolute gain of 7.6% F1 and 5.7% F1 over the next best model on the RAMS and WikiEvents dataset respectively. On the more challenging task of informative argument extraction, which requires implicit coreference reasoning, we achieve a 9.3% F1 gain over the best baseline. To demonstrate the portability of our model, we also create the first end-to-end zero-shot event extraction framework and achieve 97% of fully supervised model’s trigger extraction performance and 82% of the argument extraction performance given only access to 10 out of the 33 types on ACE.

pdf bib
RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System
Haoyang Wen | Ying Lin | Tuan Lai | Xiaoman Pan | Sha Li | Xudong Lin | Ben Zhou | Manling Li | Haoyu Wang | Hongming Zhang | Xiaodong Yu | Alexander Dong | Zhenhailong Wang | Yi Fung | Piyush Mishra | Qing Lyu | Dídac Surís | Brian Chen | Susan Windisch Brown | Martha Palmer | Chris Callison-Burch | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Heng Ji
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.