Shih-Fu Chang

Also published as: Shih-fu Chang


pdf bib
Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks
Yue Wan | Yueen Ma | Haoxuan You | Zhecan Wang | Shih-Fu Chang
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks. Existing methods mostly focus on learning the semantic connections between visual objects and linguistic content, which tend to be recognitionlevel information and may not be sufficient for commonsensical reasoning tasks like VCR. In this paper, we propose a novel commonsensical vision-language pre-training framework to bridge the gap. We first augment the conventional image-caption pre-training datasets with commonsense inferences from a visuallinguistic GPT-2. To pre-train models on image, caption and commonsense inferences together, we propose two new tasks: masked commonsense modeling (MCM) and commonsense type prediction (CTP). To reduce the shortcut effect between captions and commonsense inferences, we further introduce the domain-wise adaptive masking that dynamically adjusts the masking ratio. Experimental results on downstream tasks, VCR and VQA, show the improvement of our pre-training strategy over previous methods. Human evaluation also validates the relevance, informativeness, and diversity of the generated commonsense inferences. Overall, we demonstrate the potential of incorporating commonsense knowledge into the conventional recognition-level visual-linguistic pre-training.

pdf bib
RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
Xinya Du | Zixuan Zhang | Sha Li | Pengfei Yu | Hongwei Wang | Tuan Lai | Xudong Lin | Ziqi Wang | Iris Liu | Ben Zhou | Haoyang Wen | Manling Li | Darryl Hannan | Jie Lei | Hyounghun Kim | Rotem Dror | Haoyu Wang | Michael Regan | Qi Zeng | Qing Lyu | Charles Yu | Carl Edwards | Xiaomeng Jin | Yizhu Jiao | Ghazaleh Kazeminejad | Zhenhailong Wang | Chris Callison-Burch | Mohit Bansal | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Martha Palmer | Heng Ji
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We introduce RESIN-11, a new schema-guided event extraction&prediction framework that can be applied to a large variety of newsworthy scenarios. The framework consists of two parts: (1) an open-domain end-to-end multimedia multilingual information extraction system with weak-supervision and zero-shot learningbased techniques. (2) schema matching and schema-guided event prediction based on our curated schema library. We build a demo website based on our dockerized system and schema library publicly available for installation ( We also include a video demonstrating the system.


pdf bib
Coreference by Appearance: Visually Grounded Event Coreference Resolution
Liming Wang | Shengyu Feng | Xudong Lin | Manling Li | Heng Ji | Shih-Fu Chang
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in event coreference resolution tasks. A careful analysis reveals that the performance gain of the multimodal model especially under unsupervised settings comes from better learning of visually salient events.

pdf bib
Joint Multimedia Event Extraction from Video and Article
Brian Chen | Xudong Lin | Christopher Thomas | Manling Li | Shoya Yoshida | Lovish Chum | Heng Ji | Shih-Fu Chang
Findings of the Association for Computational Linguistics: EMNLP 2021

Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from both video and text articles. We introduce the new task of Video MultiMedia Event Extraction and propose two novel components to build the first system towards this task. First, we propose the first self-supervised cross-modal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first cross-modal transformer architecture, which extracts structured event information from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction.

pdf bib
InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection
Yi Fung | Christopher Thomas | Revanth Gangi Reddy | Sandeep Polisetty | Heng Ji | Shih-Fu Chang | Kathleen McKeown | Mohit Bansal | Avi Sil
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To defend against machine-generated fake news, an effective mechanism is urgently needed. We contribute a novel benchmark for fake news detection at the knowledge element level, as well as a solution for this task which incorporates cross-media consistency checking to detect the fine-grained knowledge elements making news articles misinformative. Due to training data scarcity, we also formulate a novel data synthesis method by manipulating knowledge elements within the knowledge graph to generate noisy training data with specific, hard to detect, known inconsistencies. Our detection approach outperforms the state-of-the-art (up to 16.8% accuracy gain), and more critically, yields fine-grained explanations.

pdf bib
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li | Haoxuan You | Zhecan Wang | Alireza Zareian | Shih-Fu Chang | Kai-Wei Chang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct “mask-and-predict” pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.

pdf bib
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
Qingyun Wang | Manling Li | Xuan Wang | Nikolaus Parulian | Guangxing Han | Jiawei Ma | Jingxuan Tu | Ying Lin | Ranran Haoran Zhang | Weili Liu | Aabhas Chauhan | Yingjun Guan | Bangzheng Li | Ruisong Li | Xiangchen Song | Yi Fung | Heng Ji | Jiawei Han | Shih-Fu Chang | James Pustejovsky | Jasmine Rah | David Liem | Ahmed ELsayed | Martha Palmer | Clare Voss | Cynthia Schneider | Boyan Onyshkevych
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence. All of the data, KGs, reports.

pdf bib
RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System
Haoyang Wen | Ying Lin | Tuan Lai | Xiaoman Pan | Sha Li | Xudong Lin | Ben Zhou | Manling Li | Haoyu Wang | Hongming Zhang | Xiaodong Yu | Alexander Dong | Zhenhailong Wang | Yi Fung | Piyush Mishra | Qing Lyu | Dídac Surís | Brian Chen | Susan Windisch Brown | Martha Palmer | Chris Callison-Burch | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Heng Ji
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.


pdf bib
Cross-media Structured Common Space for Multimedia Event Extraction
Manling Li | Alireza Zareian | Qi Zeng | Spencer Whitehead | Di Lu | Heng Ji | Shih-Fu Chang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We introduce a new task, MultiMedia Event Extraction, which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.

pdf bib
GAIA: A Fine-grained Multimedia Knowledge Extraction System
Manling Li | Alireza Zareian | Ying Lin | Xiaoman Pan | Spencer Whitehead | Brian Chen | Bo Wu | Heng Ji | Shih-Fu Chang | Clare Voss | Daniel Napierski | Marjorie Freedman
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present the first comprehensive, open source multimedia knowledge extraction system that takes a massive stream of unstructured, heterogeneous multimedia data from various sources and languages as input, and creates a coherent, structured knowledge base, indexing entities, relations, and events, following a rich, fine-grained ontology. Our system, GAIA, enables seamless search of complex graph queries, and retrieves multimedia evidence including text, images and videos. GAIA achieves top performance at the recent NIST TAC SM-KBP2019 evaluation. The system is publicly available at GitHub and DockerHub, with a narrated video that documents the system.

pdf bib
Cross-lingual Structure Transfer for Zero-resource Event Extraction
Di Lu | Ananya Subburathinam | Heng Ji | Jonathan May | Shih-Fu Chang | Avi Sil | Clare Voss
Proceedings of the Twelfth Language Resources and Evaluation Conference

Most of the current cross-lingual transfer learning methods for Information Extraction (IE) have been only applied to name tagging. To tackle more complex tasks such as event extraction we need to transfer graph structures (event trigger linked to multiple arguments with various roles) across languages. We develop a novel share-and-transfer framework to reach this goal with three steps: (1) Convert each sentence in any language to language-universal graph structures; in this paper we explore two approaches based on universal dependency parses and complete graphs, respectively. (2) Represent each node in the graph structure with a cross-lingual word embedding so that all sentences in multiple languages can be represented with one shared semantic space. (3) Using this common semantic space, train event extractors from English training data and apply them to languages that do not have any event annotations. Experimental results on three languages (Spanish, Russian and Ukrainian) without any annotations show this framework achieves comparable performance to a state-of-the-art supervised model trained from more than 1,500 manually annotated event mentions.


pdf bib
Cross-lingual Structure Transfer for Relation and Event Extraction
Ananya Subburathinam | Di Lu | Heng Ji | Jonathan May | Shih-Fu Chang | Avirup Sil | Clare Voss
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The identification of complex semantic structures such as events and entity relations, already a challenging Information Extraction task, is doubly difficult from sources written in under-resourced and under-annotated languages. We investigate the suitability of cross-lingual structure transfer techniques for these tasks. We exploit relation- and event-relevant language-universal features, leveraging both symbolic (including part-of-speech and dependency path) and distributional (including type representation and contextualized representation) information. By representing all entity mentions, event triggers, and contexts into this complex and structured multilingual common space, using graph convolutional networks, we can train a relation or event extractor from source language annotations and apply it to the target language. Extensive experiments on cross-lingual relation and event transfer among English, Chinese, and Arabic demonstrate that our approach achieves performance comparable to state-of-the-art supervised models trained on up to 3,000 manually annotated mentions: up to 62.6% F-score for Relation Extraction, and 63.1% F-score for Event Argument Role Labeling. The event argument role labeling model transferred from English to Chinese achieves similar performance as the model trained from Chinese. We thus find that language-universal symbolic and distributional representations are complementary for cross-lingual structure transfer.


pdf bib
Incorporating Background Knowledge into Video Description Generation
Spencer Whitehead | Heng Ji | Mohit Bansal | Shih-Fu Chang | Clare Voss
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Most previous efforts toward video captioning focus on generating generic descriptions, such as, “A man is talking.” We collect a news video dataset to generate enriched descriptions that include important background knowledge, such as named entities and related events, which allows the user to fully understand the video content. We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents. Then, given the video as well as the extracted events and entities, we generate a description using a Knowledge-aware Video Description network. The model learns to incorporate entities found in the topically related documents into the description via an entity pointer network and the generation procedure is guided by the event and entity types from the topically related documents through a knowledge gate, which is a gating mechanism added to the model’s decoder that takes a one-hot vector of these types. We evaluate our approach on the new dataset of news videos we have collected, establishing the first benchmark for this dataset as well as proposing a new metric to evaluate these descriptions.

pdf bib
Entity-aware Image Caption Generation
Di Lu | Spencer Whitehead | Lifu Huang | Heng Ji | Shih-Fu Chang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Current image captioning approaches generate descriptions which lack specific information, such as named entities that are involved in the images. In this paper we propose a new task which aims to generate informative image captions, given images and hashtags as input. We propose a simple but effective approach to tackle this problem. We first train a convolutional neural networks - long short term memory networks (CNN-LSTM) model to generate a template caption based on the input image. Then we use a knowledge graph based collective inference algorithm to fill in the template with specific named entities retrieved via the hashtags. Experiments on a new benchmark dataset collected from Flickr show that our model generates news-style image descriptions with much richer information. Our model outperforms unimodal baselines significantly with various evaluation metrics.


pdf bib
A Multi-media Approach to Cross-lingual Entity Knowledge Transfer
Di Lu | Xiaoman Pan | Nima Pourdamghani | Shih-Fu Chang | Heng Ji | Kevin Knight
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Cross-media Event Extraction and Recommendation
Di Lu | Clare Voss | Fangbo Tao | Xiang Ren | Rachel Guan | Rostyslav Korolov | Tongtao Zhang | Dongang Wang | Hongzhi Li | Taylor Cassidy | Heng Ji | Shih-fu Chang | Jiawei Han | William Wallace | James Hendler | Mei Si | Lance Kaplan
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations


pdf bib
Cross-document Event Coreference Resolution based on Cross-media Features
Tongtao Zhang | Hongzhi Li | Heng Ji | Shih-Fu Chang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing


pdf bib
Exploring Text and Image Features to Classify Images in Bioscience Literature
Barry Rafkind | Minsuk Lee | Shih-Fu Chang | Hong Yu
Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology