2024
pdf
bib
abs
Summary of the Visually Grounded Story Generation Challenge
Xudong Hong
|
Asad Sayeed
|
Vera Demberg
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges
Recent advancements in vision-and-language models have opened new possibilities for natural language generation, particularly in generating creative stories from visual input. We thus host an open-sourced shared task, Visually Grounded Story Generation (VGSG), to explore whether these models can create coherent, diverse, and visually grounded narratives. This task challenges participants to generate coherent stories based on sequences of images, where characters and events must be grounded in the images provided. The task is structured into two tracks: the Closed track with constraints on fixed visual features and the Open track which allows all kinds of models. We propose the first two-stage model using GPT-4o as the baseline for the Open track that first generates descriptions for the images and then creates a story based on those descriptions. Human and automatic evaluations indicate that: 1) Retrieval augmentation helps generate more human-like stories, and 2) Largescale pre-trained LLM improves story quality by a large margin; 3) Traditional automatic metrics can not capture the overall quality.
pdf
bib
abs
Do large language models and humans have similar behaviours in causal inference with script knowledge?
Xudong Hong
|
Margarita Ryzhova
|
Daniel Biondi
|
Vera Demberg
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)
Recently, large pre-trained language models (LLMs) have demonstrated superior language understanding abilities, including zero-shot causal reasoning. However, it is unclear to what extent their capabilities are similar to human ones. We here study the processing of an event B in a script-based story, which causally depends on a previous event A. In our manipulation, event A is stated, negated, or omitted in an earlier section of the text. We first conducted a self-paced reading experiment, which showed that humans exhibit significantly longer reading times when causal conflicts exist (¬ A → B) than under logical conditions (A → B). However, reading times remain similar when cause A is not explicitly mentioned, indicating that humans can easily infer event B from their script knowledge. We then tested a variety of LLMs on the same data to check to what extent the models replicate human behavior. Our experiments show that 1) only recent LLMs, like GPT-3 or Vicuna, correlate with human behavior in the ¬ A → B condition. 2) Despite this correlation, all models still fail to predict that nil → B is less surprising than ¬ A → B, indicating that LLMs still have difficulties integrating script knowledge.
pdf
bib
abs
Retrieval-Augmented Modular Prompt Tuning for Low-Resource Data-to-Text Generation
Ruitao Feng
|
Xudong Hong
|
Mayank Jobanputra
|
Mattes Warning
|
Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Data-to-text (D2T) generation describes the task of verbalizing data, often given as attribute-value pairs. While this task is relevant for many different data domains beyond the traditionally well-explored tasks of weather forecasting, restaurant recommendations, and sports reporting, a major challenge to the applicability of data-to-text generation methods is typically data sparsity. For many applications, there is extremely little training data in terms of attribute-value inputs and target language outputs available for training a model. Given the sparse data setting, recently developed prompting methods seem most suitable for addressing D2T tasks since they do not require substantial amounts of training data, unlike finetuning approaches. However, prompt-based approaches are also challenging, as a) the design and search of prompts are non-trivial; and b) hallucination problems may occur because of the strong inductive bias of these models. In this paper, we propose a retrieval-augmented modular prompt tuning () method, which constructs prompts that fit the input data closely, thereby bridging the domain gap between the large-scale language model and the structured input data. Experiments show that our method generates texts with few hallucinations and achieves state-of-the-art performance on a dataset for drone handover message generation.
2023
pdf
bib
A surprisal oracle for active curriculum language modeling
Xudong Hong
|
Sharid Loáiciga
|
Asad Sayeed
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
pdf
bib
abs
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong
|
Vera Demberg
|
Asad Sayeed
|
Qiankun Zheng
|
Bernt Schiele
Findings of the Association for Computational Linguistics: ACL 2023
Local coherence is essential for long-form text generation models. We identify two important aspects of local coherence within the visual storytelling task: (1) the model needs to represent re-occurrences of characters within the image sequence in order to mention them correctly in the story; (2) character representations should enable us to find instances of the same characters and distinguish different characters. In this paper, we propose a loss function inspired by a linguistic theory of coherence for self-supervised learning for image sequence representations. We further propose combining features from an object and a face detector to construct stronger character features. To evaluate input-output relevance that current reference-based metrics don’t measure, we propose a character matching metric to check whether the models generate referring expressions correctly for characters in input image sequences. Experiments on a visual story generation dataset show that our proposed features and loss function are effective for generating more coherent and visually grounded stories.
pdf
bib
abs
Visually Grounded Story Generation Challenge
Xudong Hong
|
Khushboo Mehra
|
Asad Sayeed
|
Vera Demberg
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges
Recent large pre-trained models have achieved strong performance in multimodal language generation, which requires a joint effort of vision and language modeling. However, most previous generation tasks are based on single image input and produce short text descriptions that are not grounded on the input images. In this work, we propose a shared task on visually grounded story generation. The input is an image sequence, and the output is a story that is conditioned on the input images. This task is particularly challenging because: 1) the protagonists in the generated stories need to be grounded in the images and 2) the output story should be a coherent long-form text. We aim to advance the study of vision-based story generation by accepting submissions that propose new methods as well as new evaluation measures.
pdf
bib
abs
Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences
Xudong Hong
|
Asad Sayeed
|
Khushboo Mehra
|
Vera Demberg
|
Bernt Schiele
Transactions of the Association for Computational Linguistics, Volume 11
Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent, diverse, and visually grounded compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and diverse than stories generated with the current state-of-the-art model. Our code, image features, annotations and collected stories are available at
https://vwprompt.github.io/.
pdf
bib
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong
|
Vera Demberg
|
Asad Sayeed
|
Qiankun Zheng
|
Bernt Schiele
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
2022
pdf
bib
abs
Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization
Dongqi Liu
|
Xudong Hong
|
Pin-Jie Lin
|
Ernie Chang
|
Vera Demberg
Proceedings of The Workshop on Automatic Summarization for Creative Writing
The Creative Summarization Shared Task at COLING 2022 aspires to generate summaries given long-form texts from creative writing. This paper presents the system architecture and the results of our participation in the Scriptbase track that focuses on generating movie plots given movie scripts. The core innovation in our model employs a two-stage hierarchical architecture for movie script summarization. In the first stage, a heuristic extraction method is applied to extract actions and essential dialogues, which reduces the average length of input movie scripts by 66% from about 24K to 8K tokens. In the second stage, a state-of-the-art encoder-decoder model, Longformer-Encoder-Decoder (LED), is trained with effective fine-tuning methods, BitFit and NoisyTune. Evaluations on the unseen test set indicate that our system outperforms both zero-shot LED baselines as well as other participants on various automatic metrics and ranks 1st in the Scriptbase track.
2020
pdf
bib
abs
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings
Xudong Hong
|
Rakshith Shetty
|
Asad Sayeed
|
Khushboo Mehra
|
Vera Demberg
|
Bernt Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning
A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.
2019
pdf
bib
abs
Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders
Xudong Hong
|
Ernie Chang
|
Vera Demberg
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)
The Multilingual Surface Realization Shared Task 2019 focuses on generating sentences from lemmatized sets of universal dependency parses with rich features. This paper describes the results of our participation in the deep track. The core innovation in our approach is to use a graph convolutional network to encode the dependency trees given as input. Upon adding morphological features, our system achieves the third rank without using data augmentation techniques or additional components (such as a re-ranker).
2018
pdf
bib
abs
Learning distributed event representations with a multi-task approach
Xudong Hong
|
Asad Sayeed
|
Vera Demberg
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics
Human world knowledge contains information about prototypical events and their participants and locations. In this paper, we train the first models using multi-task learning that can both predict missing event participants and also perform semantic role classification based on semantic plausibility. Our best-performing model is an improvement over the previous state-of-the-art on thematic fit modelling tasks. The event embeddings learned by the model can additionally be used effectively in an event similarity task, also outperforming the state-of-the-art.
2016
pdf
bib
Roleo: Visualising Thematic Fit Spaces on the Web
Asad Sayeed
|
Xudong Hong
|
Vera Demberg
Proceedings of ACL-2016 System Demonstrations