CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

Hyounghun Kim, Abhay Zala, Mohit Bansal


Abstract
As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm) which is designed to evaluate the ability of AI systems to reason about scene change imagination. To be specific, in this multimodal task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging, counterfactual multimodal task.
Anthology ID:
2022.naacl-main.66
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
911–923
Language:
URL:
https://aclanthology.org/2022.naacl-main.66
DOI:
10.18653/v1/2022.naacl-main.66
Bibkey:
Cite (ACL):
Hyounghun Kim, Abhay Zala, and Mohit Bansal. 2022. CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 911–923, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination (Kim et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.66.pdf
Software:
 2022.naacl-main.66.software.zip
Video:
 https://aclanthology.org/2022.naacl-main.66.mp4
Code
 hyounghk/cosim
Data
VCRVisual GenomeVisual Question Answering