Christopher Thomas


pdf bib
Enhanced Chart Understanding via Visual Language Pre-training on Plot Table Pairs
Mingyang Zhou | Yi Fung | Long Chen | Christopher Thomas | Heng Ji | Shih-Fu Chang
Findings of the Association for Computational Linguistics: ACL 2023

Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language (V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains.


pdf bib
Weakly-Supervised Temporal Article Grounding
Long Chen | Yulei Niu | Brian Chen | Xudong Lin | Guangxing Han | Christopher Thomas | Hammad Ayyubi | Heng Ji | Shih-Fu Chang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptions make today’s VG models fail to work in practice. For example, in real-world multimodal assets (eg, news articles), most of the sentences in the article can not be grounded in their affiliated videos, and they typically have rich hierarchical relations (ie, at different semantic scales). To this end, we propose a new challenging grounding task: Weakly-Supervised temporal Article Grounding (WSAG). Specifically, given an article and a relevant video, WSAG aims to localize all “groundable” sentences to the video, and these sentences are possibly at different semantic scales. Accordingly, we collect the first WSAG dataset to facilitate this task: YouwikiHow, which borrows the inherent multi-scale descriptions in wikiHow articles and plentiful YouTube videos. In addition, we propose a simple but effective method DualMIL for WSAG, which consists of a two-level MIL loss and a single-/cross- sentence constraint loss. These training objectives are carefully designed for these relaxed assumptions. Extensive ablations have verified the effectiveness of DualMIL.


pdf bib
Joint Multimedia Event Extraction from Video and Article
Brian Chen | Xudong Lin | Christopher Thomas | Manling Li | Shoya Yoshida | Lovish Chum | Heng Ji | Shih-Fu Chang
Findings of the Association for Computational Linguistics: EMNLP 2021

Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from both video and text articles. We introduce the new task of Video MultiMedia Event Extraction and propose two novel components to build the first system towards this task. First, we propose the first self-supervised cross-modal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first cross-modal transformer architecture, which extracts structured event information from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction.

pdf bib
InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection
Yi Fung | Christopher Thomas | Revanth Gangi Reddy | Sandeep Polisetty | Heng Ji | Shih-Fu Chang | Kathleen McKeown | Mohit Bansal | Avi Sil
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To defend against machine-generated fake news, an effective mechanism is urgently needed. We contribute a novel benchmark for fake news detection at the knowledge element level, as well as a solution for this task which incorporates cross-media consistency checking to detect the fine-grained knowledge elements making news articles misinformative. Due to training data scarcity, we also formulate a novel data synthesis method by manipulating knowledge elements within the knowledge graph to generate noisy training data with specific, hard to detect, known inconsistencies. Our detection approach outperforms the state-of-the-art (up to 16.8% accuracy gain), and more critically, yields fine-grained explanations.


pdf bib
Samsung: Align-and-Differentiate Approach to Semantic Textual Similarity
Lushan Han | Justin Martineau | Doreen Cheng | Christopher Thomas
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)