Shan Jiang


2024

pdf bib
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
Yue Fan | Jing Gu | Kaiwen Zhou | Qianqi Yan | Shan Jiang | Ching-Chen Kuo | Yang Zhao | Xinze Guan | Xin Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on MLLMs’ multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of MLLMs in understanding multipanel images, we analyze various factors of the multipanel image that affect MLLMs’ performance with synthetic data and offer insights for enhancement.

2021

pdf bib
Structurizing Misinformation Stories via Rationalizing Fact-Checks
Shan Jiang | Christo Wilson
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Misinformation has recently become a well-documented matter of public concern. Existing studies on this topic have hitherto adopted a coarse concept of misinformation, which incorporates a broad spectrum of story types ranging from political conspiracies to misinterpreted pranks. This paper aims to structurize these misinformation stories by leveraging fact-check articles. Our intuition is that key phrases in a fact-check article that identify the misinformation type(s) (e.g., doctored images, urban legends) also act as rationales that determine the verdict of the fact-check (e.g., false). We experiment on rationalized models with domain knowledge as weak supervision to extract these phrases as rationales, and then cluster semantically similar rationales to summarize prevalent misinformation types. Using archived fact-checks from Snopes.com, we identify ten types of misinformation stories. We discuss how these types have evolved over the last ten years and compare their prevalence between the 2016/2020 US presidential elections and the H1N1/COVID-19 pandemics.

2018

pdf bib
DEMN: Distilled-Exposition Enhanced Matching Network for Story Comprehension
Chunhua Liu | Haiou Zhang | Shan Jiang | Dong Yu
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation