IgSEG: Image-guided Story Ending Generation

In this work, we propose a new task called Image-guided Story Ending Generation (IgSEG). Given a multi-sentence story plot and an ending-related image, IgSEG aims to generate a story ending that conforms to the contextual logic and the relevant visual concepts. In contrast to the story ending generation task, which generates open-ended endings, the major challenges of IgSEG are to comprehend the given context and image sufﬁciently, and mine the appropriate semantics from the image to make the generated story ending informative, reasonable, and coherent. To address the challenges, we propose a Multi-layer Graph convolution and Cascade-LSTM (MGCL) based model which mainly comprises of two collaborative modules: i) a multi-layer graph convolutional network to learn the dependency relations of sentences and the logical clue of the context; ii) a multiple context-image attention module to generate the story endings by gradually incorporating textual and visual semantic concepts. Our MGCL is thus capable of building logically consistent and semantically rich story endings. To evaluate the proposed model, we modify the existing VIST dataset to obtain the VIST-Ending dataset. Empirically, our MGCL outperforms all the strong baselines on both automatic and human evaluation.


Introduction
As two challenging subtasks of story generation, the story ending generation (SEG) and visual storytelling (Huang et al., 2016;Zhao et al., 2018) have attracted more attention recently. The former generates text-based story endings (Zhao et al., 2018;Li et al., 2018;Guan et al., 2019). While, the latter generates photo-streams-based stories (Huang et al., 2016;Wang et al., 2018;Hu et al., 2020) or one-image-based stories (Gaur, 2019). Distinctly, I went to a gun training class. I was not the only one, my friends came with me. We inspected ammo. And we learn how to shoot at targets.

IC IC SEG SEG
We were proud of our hits.
We were proud of our hits.
Two people holding paper targets.
Two people holding paper targets.
It was a happy day.
It was a happy day.

IgSEG IgSEG
Figure 1: In SEG, existing methods tend to generate generic, safe, and inane story endings, e.g., (c). IgSEG is designed to generate specific, reasonable and informative endings induced by the given ending-related image. (d) is generated by the proposed MGCL model. both of them are input with single-modal information merely. Actually, people often confront with demands to handle multi-modal inputs for generating a sentence or paragraph, e.g., comments generation given a news story and an image and picture composition with a leading paragraph. However, to our best knowledge, the SEG task incorporating a context and an image is still under-explored.
Furthermore, due to the limited textual information of the story context, the generated endings of SEG models remain tending to be generic, safe, and inane. To make the generation of story endings more coherent, specific, and informative, we consider introducing visual information to enrich the generation of story endings. For example (cf. Figure 1), the story context (a) mainly narrates that the experience of someone went for gun training. The story ending generated by SEG (c) just talks about the feeling (e.g., happy) of the day, which seems to be generic, safe, and unattractive for lack of interesting events, imaginative conception, and evocative plots. Meanwhile, Image Captioning (e) generates the description of a given image (b) with-out any story context plot. Here, we introduce the image to induce the development of the story plot and guide the generation of the ending. The image-guided story ending (d) is associated with the senior semantic (e.g., proud) and events (e.g., hits) from the visual information. Obviously, this ending seems to be high-quality compared with the one generated by SEG.
We herein propose an Image-guided Story Ending Generation (IgSEG) task, which aims at generating a story ending with contextual plots and an ending-related image. Models need to comprehend the story plots and the image information, and grasp the visual semantic concepts strongly related to the story plots (e.g., event, behavior, and emotion). The main challenges of this task are three-fold: (i) How to accurately select and capture appropriate visual concepts matching the development trend of the story plot from the image. (ii) How to fuse the language and visual information and model inter-and intra-modality relations efficiently. (iii) How to make the utmost of highlevel semantics mined from the image to write coherent, semantically-informative, and imaginative story endings.
To capture the text contextual plots and merge visual features effectively, we propose a Multi-layer Graph convolution and Cascade-LSTM (MGCL) model. A multi-layer graph convolution module is constructed to capture and encode the clues information (e.g., dependency relations ) hidden in context. In detail, following (Huang et al., 2021), for each sentence, we construct a graph over the dependency parsing tree and conduct convolutional operations by Graph Convolution Networks (GCN). We then employ attention mechanism to compress each graph as one node and deliver the node from low layer to high layer for aggregation of inter-sentence information. Furthermore, inspired by the work (Anderson et al., 2018), we design a Multiple Context-Image Attention (MCIA) module to merge the contextual features and the image features. Specifically, we apply attention mechanism to weight sentence features and image features separately, then concatenate and feed them to the next Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) cell. We perform experiments on the VIST-Ending (VIST-E) dataset which is modified from the VIST (Huang et al., 2016).
Our contributions can be summarized as follows: • We define a new task termed IgSEG to generate coherent, specific, and informative story endings guided by an ending-related image. To our best knowledge, this is the first story generation task with multi-modal inputs.
• We propose a model called MGCL, which employs multi-layer graph convolutional operations to capture story contextual plots and multiple context-image attentions to merge visual features effectively.
• Experiments show that our model outperforms several strong baselines on the VIST-E dataset. Human evaluations show that our model can generate story endings with better grammaticality, logicality, and relevance.

Related Work
The IgSEG task is related to (i) Story Ending Generation (SEG) and (ii) Visual Storytelling (VIST). SEG (Zhao et al., 2018) is a subtask of story generation, which aims to understand the context and generate a coherent story ending. Many researchers have made great efforts on SEG. To improve the diversity and rationality of the generated story endings, (Li et al., 2018) tried to employ a Seq2Seq model based on adversarial training. Similarly, (Guan et al., 2019) made the model generate a reasonable ending by introducing external commonsense knowledge. Further, (Wang and Wan, 2019) adopted a transformer-based conditional autoencoder to capture contextual clues to improve coherence of story endings. (Guan et al., 2020) proposed a knowledge-enhanced pretraining approach for generating more reasonable stories. (Huang et al., 2021) proposed a multi-level GCN to capture the dependency relations of input sentences. Although previous studies have made great progress, due to the limitations of the SEG task itself, the generated endings tend to be generic and safe to some extent. IgSEG is relevant to VIST as well. VIST aims to generate a coherent story according to an image stream. The main difficulty of VIST is how to generate image-relevant sentences. The previous works on VIST can be roughly divided into three categories. The first category focuses on designing specific model architectures to improve the quality of the generated stories (Kim et al., 2018;. The second one generate more (1) a n I went to a gun training class.
I was not the only one, my friends come with me.
We inspected ammo.
And we learn how to shoot at targets. For intra-sentence information, each sentence is constructed as a graph over dependency parsing. For inter-sentence information, each graph is compressed as one node and is delivered to next layer. The output representation of each graph (i.e., S 1 , S 2 , S 3 , S 4 ) are feed into decoder.
expressive output with reinforcement learning and adversarial training (Wang et al., 2018;Mo et al., 2019;Hu et al., 2020). The third one generates more common-sense stories by incorporating external knowledge. Wang et al., 2020;Jung et al., 2020). However, the inputs of both SEG and VIST are single-modal information. The work on generating story endings given a textual sequence and an image simultaneously is unexplored. Therefore we propose the IgSEG task.

Overview
The proposed IgSEG task aims to generate a story ending conforming the given contextual and visual information. Given a story context X = X 1 , X 2 , · · · , X µ and an ending-related image V, where X µ = x µ 1 x µ 2 · · · x µ c represents the µ-th sentence with c words, IgSEG aims at generating a story ending E = y 1 y 2 · · · y m with m words.
To generate the contextual-consistent and imagerelated story endings, we propose a Multi-layer Graph convolutional networks and Cascade-LSTM (MGCL) model based on the encoder-decoder framework. In the encoder, we propose a Multilayer Graph Convolutional Networks (MGCN) over dependency trees to learn the context representation (cf. Figure 2), and we extract the image features with ResNet-152 (He et al., 2016). When decoding, we generate the story ending with the cascaded LSTM framework. Specifically, we employ Top-Down LSTM to joint the context features and image features. And we devise a Multiple Context-Image Attention (MCIA) module to grasp the image-related context and contextual-relevant information of image for text generation. We will introduce each part of MGCL below.

Story Context Representation
Graph Construction We parse the sentences with Stanford Dependency tool (De Marneffe et al., 2014) (cf. Figure 3). To capture the relations of words in a sentence, we construct a graph G over the dependency parsing tree for each sentence. Regarding the words x as nodes O k , the word representation n i as node feature, and the corresponding relations on the dependency parsing tree as edges ξ k , the graph G k of k-th sentence (k=1,2,3,4) can be constructed: Multi-Layer GCN To deliver inter-sentence information, we utilize attention mechanism to weight The MCIA module consists of µ LSTM. Thev denotes the mean-pooled image features, thes denotes the meanpooled context features. The symbol "· · · " denotes the omitted parts of MCIA module. each node and sum them together as a new node n (k) a for the (k + 1)-th layer GCN (cf. Figure 2): where n k i denotes the features of the i-th word of the k-th sentence, W 0 and b 0 are trainable parameters.
After updating the nodes of (k + 1)-th layer GCN, the graph G k+1 structure is represented by The corresponding value A ij is 1 if the relation exists between node i and node j, otherwise it is 0. The representations of node i and its neighbor node j ∈ φ(i) are n k+1 i and n k+1 j , respectively. To obtain the correlation score w k+1 ij between node i and node j, we learn a connected layer over concatenation of nodes features: are trainable parameters, σ is the non-linear activation function, (·) T denotes transpose operation, and [; ] denotes the concatenation operation.
We apply the softmax function over the correlation score w ij to obtain the weight α ij : In the adjacency matrix A k+1 , the value is α k+1 ij if the relation exists between node i and node j, can be denoted as: For each node of the (k + 1)-th GCN layer, we update the (h + 1)-th representation of node n h+1 i with aggregating the representations of h-th neighboring nodes n h j . This procedure is denoted as: where W h 2 and b h 2 are trainable parameters. By l updates, the output S k+1 of GCN is denoted as:

Decoder
The inputs of the decoder are the context features S = S k 4 k=1 and the image features v extracted with the pre-trained model ResNet152 (He et al., 2016), as shown in Figure 4. Top-Down LSTM (TD) Following previous work (Anderson et al., 2018), we employ Top-Down LSTM to incorporate the visual information (cf. Figure 4(a)). We operate LSTM over a single time step in the decoder with the following notation: where x t is the input vector of LSTM and h t is the output vector. The inputs of TD module x D t consists of the previous output h L t−1 of MCIA module, the mean-pooled image featuresv, and the embedding of the previously generated word E(w t−1 ), where t denotes the current time step.
To incorporate the context information, we modify the original inputs of TD. Firstly, we calculate the mean-pooled context featuress: where µ denotes the number of sentences, c denotes the number of words of each sentence, n k i denotes the representation of the i-th word of the k-th sentence. Then, the vector x D t is denoted as: Multiple Context-Image Attention (MCIA) To merge the context features and image features, we devise the MCIA module. The MCIA module consists of four LSTM layers (cf. Figure 4(b)), which share all the parameters. The output h D t of TD is input to the MCIA module. Given the output h k t of (k − 1)-th LSTM layer in MCIA module (h 1 t = h D t ), at each time step, we calculate the normalized attention weight a k i,t for each of word representations n k i of the k-th sentence: where w k a , W k a , and W hk a are trainable parameters. The convex combination of all wordsŝ k t can be calculated by n k i : Likewise, we calculate the normalized weight b k i,t for features v i of each region of the image: where w k b , W k b , and W hk b are trainable parameters. The convex combination of the imagev k t can be calculated by the image features v i : where M denotes the the number of region of the image. We concatenateŝ k t ,v k t , and h k t as inputs of the next LSTM layer: Given the output h L t of MCIA module, we calculate the conditional distribution over possible output words at each time step as follows: where W p and b p trainable parameters, and y 1:m is the notation to refer to a sequence of words (y 1 , · · · , y m ). Finally, the product of conditional distributions can be obtained by: p(y 1:m ) = T t=1 p(y t |y 1:t−1 ).

Dataset
To serve the IgSEG task, we modify the VIST dataset (Huang et al., 2016) to obtain a VIST-Ending (VIST-E) dataset, as shown in Table 1

Baselines
We compare our model with following models.
Seq2Seq is a stack RNN-based model (Luong et al., 2015) with attention mechanisms. Transformer is a parallel model based solely on attention mechanisms (Vaswani et al., 2017). IE+MSA incorporates external knowledge with incremental encoding model for story ending generation (Guan et al., 2019). T-CVAE is a transformer-based conditional variational autoencoder for missing story plots generation (Wang and Wan, 2019   direct word-ordering. CIDEr (C) (Vedantam et al., 2015) evaluates the similarity of a generated sentence against the references by human consensus. ROUGE-L (R-L) (Lin, 2004) is applied to find the length of the longest common subsequence. Human Evaluation Considering the limitation of automatic evaluation and the complexity of the IgSEG task, it is necessary to conduct human evaluation. The criteria of human evaluation includes three aspects: Grammaticality (Gram.) (Wang and Wan, 2019) evaluates correct, natural, and fluent of the generated story endings. Logicality (Logic.) (Wang and Wan, 2019) evaluates whether the story endings are reasonable and coherent. Relevance (Rele.)  measures how relevant the generated story endings and the input images are. We randomly pick 100 generated story endings from test-set for each model and employ three professional annotators skills to make evaluation. Following , we apply a 5-grade marking system, with 5 as the maximum grade and 1 as the worst. The final results are the average of the scores given by the three annotators.

Experimental Settings
The dimension of word embedding is 300 from GloVe.6B (Pennington et al., 2014). The update times of each GCN is 5, the maximum number of nodes in GCN is 43. The hidden layer dimension of all LSTM is 512. The number of LSTM layer is 4 in MCIA module. The dimension of image features is 7 × 7 × 2048 from ResNet-152 (He et al., 2016). During training on the VIST-E dataset, the epoch is set to 30 and the batch size is 128. The optimizer is Adam with an initial learning rate of 4e-4. All baselines keep their own default settings. The dropout rate is 0.5. Specially, inputs of Seq2Seq, Transformer, IE+MSA, and T-CVAE are concatenated with context representations and image features.

Automatic and Manual Evaluation
We perform experiments on the VIST-E dataset comparing with several strong baselines, i.e., Seq2Seq, Transformer, IE+MSA, and T-CVAE.
The results of automatic and manual evaluation are shown in Table 2. We have done significant test comparing our model with these baselines by running all these models ten times. The results shows that our model significantly outperforms them with all p-values < 0.01. Specifically, our model implements an improvement of 8.66 / 5.39 / 3.42 / 8.23 / 3.14 / 1.66 over the Seq2Seq / Transformer / IE-MAS / T-CVAE / MG+Trans / MG+CIA on B1. As for B4, our model achieves an improvement of 0.8 / 0.48 / 0.86 / 1.36 / 0.03 / 0.14 over the Our MGCL model also outperforms baselines on all manual evaluation. Compared with the best baseline, Gram. increases from 3.46 to 3.51, Logic.   Table 4. Compared with the corresponding results in Table 2, the Seq2Seq, Transformer, and our models have poor performance overall, which indicates the image is helpful for generating better endings. But for IE+MSA and T-CVAE model, they have poor performance when adding the image. One possible reason is that they are designed for the textual story generation specially, so it is hard to change to generate better story endings with an image. Further, we also conduct the SEG experiments with another two recent models, the Plan&Write model (Yao et al., 2019) and the pretrained language KE-GPT2 model (Guan et al., 2020). The results show that KE-GPT2 achieves the best performance on the plain text dataset, while our models are close to it.  To research the quantity transformation of part of speech, we count the number of Noun, Verb, and Adjective in the generated story endings (cf. Table  5). With the guidance of image, the story endings achieves an improvement of 18.87%, 22.41%, and 11.53% on Noun, Verb, and Adjective, respectively. The results indicates that the MGCL model can enrich story endings on the IgSEG task.

Visualization and Case Study
To explicitly demonstrate our model, we present the visualization and case study (cf. Figure 5).
The context (Figure 5(a)) is mainly about people going to the seaside for a holiday. The context is encoded by the MGCN module, and fed with the image features together into the MCIA module. Key words are marked in red by our model, the darker means more important. The entities, events, and emotion (e.g., We, go, trip, and exciting) are assigned more attention weights by our model. It shows that our model can understand the semantic information of context sufficiently. Similarly, we present visualization for image ( Figure 5(b)). The regions in darker red are where our model focuses on. Our model pays more attention to the regions with the important objects (e.g., bridge, city, and sky, which can be regarded as view). It shows that our model can also capture the vital visual concepts in the image. As shown in Figure  5(c), we present the cases of baselines on SEG and IgSEG, respectively. On SEG, the Seq2Seq model generates a long ending with repeat words , but it is still reasonable. The Transformer and T-CVAE generate the generic endings with positive sentiment. Our model generates the sentence describing how "we" ended the day, which is more interesting. On IgSEG, all the models generate the more specific endings where the visual concepts are mentioned. These cases show that our model can capture context-relevant image concepts and generate informative story plots.

GT:
We spent the rest of the day on the water, it was a great day!

Visualization of Sentence Attention Weights
(a)

Ours
At the end of the day , we went back to location location and view what a beautiful sky, it was a trip home from many times. At the end of the day, we had a great time.
We were glad to be going to the future.
We went to the bridge.
After that we decided to go back to the beach.
We these the tired enjoy my day in a nap.
We ended the day by watching the bridge and enjoy the view.
X 1 :Everyone was excited to be going on vocation. X 2 :We stopped at a memorial and spent our time reading about the history of it. X 3 :It was fourth of July that day, we were proud to be American. X 4 :We decided to go on a diving trip, it seemed exciting.

IE+MSA
We were all gone, but we had fun all the time.
We have the great time at the beach

MG+Trans(Ours)
We ended the day with a beautiful sunset. It was a great day at the beach. Ours † We ended the day with a beautiful view of the water.
We have the great time at the beach. We ended the day with a great view. We ended the day with a group of friends.
They were very proud of the accomplishment.
GT: In the end we all became friends.
GT: Even though they were wet at the end of the day they felt as sense of accomplishment (a) (b) My company went on a team building adventure.
We took part in many games.
We learned how to work together.
We built our own raft.
The team went on a team building exercise. Some of the members of the group created rafts with tubes and long poles.
Other members of the group carried individual tubes on sticks to the water. After the rafts were constructed many of the team members put them in the water to test them out. To vividly demonstrate the impact of image on IgSEG task, we show the generation cases (cf. Figure 6) which are offered the same ending-related image but the different story context. The content of image is that five people are very happy and jump in front of camera. The generated endings (a) and (b) are both coherent with their corresponding context. The context (a) has the logic chain (e.g., team building → took part in games → work together → built raft), and the context (b) has the logic chain (e.g., team building → created raft → carried tubes → rafts test). According to different logic chains, our model may focus on different regions of image and generate the story endings with the various semantics. The context (a) merely links to number of people in the image (e.g., friends), while the context (b) may be associated with peo-ple's postures and emotions (e.g., dump and laugh mean proud). To some extent, our MGCL model is able to capture some latent high-level semantics (e.g., pride and celebration) hidden in the image.

Conclusion
We propose a new task termed Image-guided Story Ending Generation. We transform the VIST dataset to VIST-Ending for IgSEG. We propose a MGCL model which uses a multi-layer graph convolutional networks to capture intra-and inter-sentence relations, a multiple context-image attention module to merge the context features and image features. Results on automatic and manual evaluation show that our model outperforms all the baselines.