Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi


Abstract
Understanding manipulated media, from automatically generated ‘deepfakes’ to manually edited ones, raises novel research challenges. Because the vast majority of edited or manipulated images are benign, such as photoshopped images for visual enhancements, the key challenge is to understand the complex layers of underlying intents of media edits and their implications with respect to disinformation. In this paper, we study Edited Media Frames, a new formalism to understand visual media manipulation as structured annotations with respect to the intents, emotional reactions, attacks on individuals, and the overall implications of disinformation. We introduce a dataset for our task, EMU, with 56k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 48.2% of the time. At the same time, there is still much work to be done – and we provide analysis that highlights areas for further progress.
Anthology ID:
2021.acl-long.158
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2026–2039
Language:
URL:
https://aclanthology.org/2021.acl-long.158
DOI:
10.18653/v1/2021.acl-long.158
Bibkey:
Cite (ACL):
Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, and Yejin Choi. 2021. Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2026–2039, Online. Association for Computational Linguistics.
Cite (Informal):
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation (Da et al., ACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.158.pdf
Video:
 https://aclanthology.org/2021.acl-long.158.mp4
Data
Conceptual Captions