Xudong Hong


pdf bib
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele
Findings of the Association for Computational Linguistics: ACL 2023

Local coherence is essential for long-form text generation models. We identify two important aspects of local coherence within the visual storytelling task: (1) the model needs to represent re-occurrences of characters within the image sequence in order to mention them correctly in the story; (2) character representations should enable us to find instances of the same characters and distinguish different characters. In this paper, we propose a loss function inspired by a linguistic theory of coherence for self-supervised learning for image sequence representations. We further propose combining features from an object and a face detector to construct stronger character features. To evaluate input-output relevance that current reference-based metrics don’t measure, we propose a character matching metric to check whether the models generate referring expressions correctly for characters in input image sequences. Experiments on a visual story generation dataset show that our proposed features and loss function are effective for generating more coherent and visually grounded stories.

pdf bib
Visually Grounded Story Generation Challenge
Xudong Hong | Khushboo Mehra | Asad Sayeed | Vera Demberg
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

Recent large pre-trained models have achieved strong performance in multimodal language generation, which requires a joint effort of vision and language modeling. However, most previous generation tasks are based on single image input and produce short text descriptions that are not grounded on the input images. In this work, we propose a shared task on visually grounded story generation. The input is an image sequence, and the output is a story that is conditioned on the input images. This task is particularly challenging because: 1) the protagonists in the generated stories need to be grounded in the images and 2) the output story should be a coherent long-form text. We aim to advance the study of vision-based story generation by accepting submissions that propose new methods as well as new evaluation measures.

pdf bib
Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences
Xudong Hong | Asad Sayeed | Khushboo Mehra | Vera Demberg | Bernt Schiele
Transactions of the Association for Computational Linguistics, Volume 11

Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent, diverse, and visually grounded compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and diverse than stories generated with the current state-of-the-art model. Our code, image features, annotations and collected stories are available at https://vwprompt.github.io/.

pdf bib
A surprisal oracle for active curriculum language modeling
Xudong Hong | Sharid Loáiciga | Asad Sayeed
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)


pdf bib
Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization
Dongqi Pu | Xudong Hong | Pin-Jie Lin | Ernie Chang | Vera Demberg
Proceedings of The Workshop on Automatic Summarization for Creative Writing

The Creative Summarization Shared Task at COLING 2022 aspires to generate summaries given long-form texts from creative writing. This paper presents the system architecture and the results of our participation in the Scriptbase track that focuses on generating movie plots given movie scripts. The core innovation in our model employs a two-stage hierarchical architecture for movie script summarization. In the first stage, a heuristic extraction method is applied to extract actions and essential dialogues, which reduces the average length of input movie scripts by 66% from about 24K to 8K tokens. In the second stage, a state-of-the-art encoder-decoder model, Longformer-Encoder-Decoder (LED), is trained with effective fine-tuning methods, BitFit and NoisyTune. Evaluations on the unseen test set indicate that our system outperforms both zero-shot LED baselines as well as other participants on various automatic metrics and ranks 1st in the Scriptbase track.


pdf bib
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings
Xudong Hong | Rakshith Shetty | Asad Sayeed | Khushboo Mehra | Vera Demberg | Bernt Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning

A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.


pdf bib
Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders
Xudong Hong | Ernie Chang | Vera Demberg
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

The Multilingual Surface Realization Shared Task 2019 focuses on generating sentences from lemmatized sets of universal dependency parses with rich features. This paper describes the results of our participation in the deep track. The core innovation in our approach is to use a graph convolutional network to encode the dependency trees given as input. Upon adding morphological features, our system achieves the third rank without using data augmentation techniques or additional components (such as a re-ranker).


pdf bib
Learning distributed event representations with a multi-task approach
Xudong Hong | Asad Sayeed | Vera Demberg
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Human world knowledge contains information about prototypical events and their participants and locations. In this paper, we train the first models using multi-task learning that can both predict missing event participants and also perform semantic role classification based on semantic plausibility. Our best-performing model is an improvement over the previous state-of-the-art on thematic fit modelling tasks. The event embeddings learned by the model can additionally be used effectively in an event similarity task, also outperforming the state-of-the-art.


pdf bib
Roleo: Visualising Thematic Fit Spaces on the Web
Asad Sayeed | Xudong Hong | Vera Demberg
Proceedings of ACL-2016 System Demonstrations