Improving Generation and Evaluation of Visual Stories via Semantic Consistency

Adyasha Maharana, Darryl Hannan, Mohit Bansal


Abstract
Story visualization is an underexplored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations.
Anthology ID:
2021.naacl-main.194
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2427–2442
Language:
URL:
https://aclanthology.org/2021.naacl-main.194
DOI:
10.18653/v1/2021.naacl-main.194
Bibkey:
Cite (ACL):
Adyasha Maharana, Darryl Hannan, and Mohit Bansal. 2021. Improving Generation and Evaluation of Visual Stories via Semantic Consistency. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2427–2442, Online. Association for Computational Linguistics.
Cite (Informal):
Improving Generation and Evaluation of Visual Stories via Semantic Consistency (Maharana et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.194.pdf
Video:
 https://aclanthology.org/2021.naacl-main.194.mp4
Code
 adymaharana/StoryViz