StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models

Li Yang, Zhiding Xiao, Wenxin Huang, Xian Zhong


Abstract
The rapid development of multimodal large language models (MLLMs) has positioned visual storytelling as a crucial area in content creation. However, existing models often struggle to maintain temporal, spatial, and narrative coherence across image sequences, and they frequently lack the depth and engagement of human-authored stories. To address these challenges, we propose Story with Large Language-and-Vision Alignment (StoryLLaVA), a novel framework for enhancing visual storytelling. Our approach introduces a topic-driven narrative optimizer that improves both the training data and MLLM models by integrating image descriptions, topic generation, and GPT-4-based refinements. Furthermore, we employ a preference-based ranked story sampling method that aligns model outputs with human storytelling preferences through positive-negative pairing. These two phases of the framework differ in their training methods: the former uses supervised fine-tuning, while the latter incorporates reinforcement learning with positive and negative sample pairs. Experimental results demonstrate that StoryLLaVA outperforms current models in visual relevance, coherence, and fluency, with LLM-based evaluations confirming the generation of richer and more engaging narratives. The enhanced dataset and model will be made publicly available soon.
Anthology ID:
2025.coling-main.266
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3936–3951
Language:
URL:
https://aclanthology.org/2025.coling-main.266/
DOI:
Bibkey:
Cite (ACL):
Li Yang, Zhiding Xiao, Wenxin Huang, and Xian Zhong. 2025. StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 3936–3951, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models (Yang et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.266.pdf