Xian Zhong

2025

pdf bib abs
StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models
Li Yang | Zhiding Xiao | Wenxin Huang | Xian Zhong
Proceedings of the 31st International Conference on Computational Linguistics

The rapid development of multimodal large language models (MLLMs) has positioned visual storytelling as a crucial area in content creation. However, existing models often struggle to maintain temporal, spatial, and narrative coherence across image sequences, and they frequently lack the depth and engagement of human-authored stories. To address these challenges, we propose Story with Large Language-and-Vision Alignment (StoryLLaVA), a novel framework for enhancing visual storytelling. Our approach introduces a topic-driven narrative optimizer that improves both the training data and MLLM models by integrating image descriptions, topic generation, and GPT-4-based refinements. Furthermore, we employ a preference-based ranked story sampling method that aligns model outputs with human storytelling preferences through positive-negative pairing. These two phases of the framework differ in their training methods: the former uses supervised fine-tuning, while the latter incorporates reinforcement learning with positive and negative sample pairs. Experimental results demonstrate that StoryLLaVA outperforms current models in visual relevance, coherence, and fluency, with LLM-based evaluations confirming the generation of richer and more engaging narratives. The enhanced dataset and model will be made publicly available soon.

Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real-world financial environments remains underexplored. Existing approaches are confined to historical backtesting, where trading actions cannot influence market prices, and agents train on static data. To overcome this limitation, we present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive, mult-agent trading and directly impact price dynamics. By simulating realistic bid-ask interactions, our platform enables agents to train in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments show that LLMs struggle with numerical reasoning when given plain-text data, tending to overfit local patterns and recent values. In contrast, chart-based visualizations significantly boost both numerical reasoning and trading performance. Moreover, integrating a reflection module yields further improvements, especially with visual inputs. Finally, evaluations of the NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://github.com/wekjsdvnm/Agent-Trading-Arena.

Co-authors

Liang Xie 1

Li Yang 1

Joey Tianyi Zhou 1

Venues

coling1
findings1

Fix author