SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Eileen Wang; Caren Han; Josiah Poon

doi:10.18653/v1/2024.eacl-long.96

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Abstract

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall’s algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.

Anthology ID:: 2024.eacl-long.96
Volume:: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Editors:: Yvette Graham, Matthew Purver
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1602–1616
Language:
URL:: https://aclanthology.org/2024.eacl-long.96/
DOI:: 10.18653/v1/2024.eacl-long.96
Bibkey:
Cite (ACL):: Eileen Wang, Caren Han, and Josiah Poon. 2024. SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1602–1616, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):: SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling (Wang et al., EACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.eacl-long.96.pdf
Video:: https://aclanthology.org/2024.eacl-long.96.mp4

PDF Cite Search Video Fix data