Khushboo Mehra


2023

pdf bib
Visually Grounded Story Generation Challenge
Xudong Hong | Khushboo Mehra | Asad Sayeed | Vera Demberg
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

Recent large pre-trained models have achieved strong performance in multimodal language generation, which requires a joint effort of vision and language modeling. However, most previous generation tasks are based on single image input and produce short text descriptions that are not grounded on the input images. In this work, we propose a shared task on visually grounded story generation. The input is an image sequence, and the output is a story that is conditioned on the input images. This task is particularly challenging because: 1) the protagonists in the generated stories need to be grounded in the images and 2) the output story should be a coherent long-form text. We aim to advance the study of vision-based story generation by accepting submissions that propose new methods as well as new evaluation measures.

pdf bib
Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences
Xudong Hong | Asad Sayeed | Khushboo Mehra | Vera Demberg | Bernt Schiele
Transactions of the Association for Computational Linguistics, Volume 11

Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent, diverse, and visually grounded compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and diverse than stories generated with the current state-of-the-art model. Our code, image features, annotations and collected stories are available at https://vwprompt.github.io/.

2020

pdf bib
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings
Xudong Hong | Rakshith Shetty | Asad Sayeed | Khushboo Mehra | Vera Demberg | Bernt Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning

A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.