Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

While much research has been done in text-to-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we also incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images within a dual learning setup. We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. We train the model end-to-end using intra-story contrastive loss (between words and image sub-regions) and show significant improvements in several metrics (and human evaluation) for multiple datasets. Finally, we provide an analysis of the linguistic and visuo-spatial information. Code and data: https://github.com/adymaharana/VLCStoryGan.


Introduction
Story Visualization is an emerging area of research with several potentially interesting applications such as visualization of educational materials, assisting artists with web-comic creation etc. Each story consists of a sequence of images along with a sequence of captions describing the content of the images. The goal of the task is to reproduce the images given the captions. It is more challenging than conventional text-to-image generation (Reed et al., 2016) because the generative model needs to identify the narrative structure expressed in the sequence of captions and translate it into a story of images. Some critical features of a good story include consistent character and background appearances, relevance to individual captions as well as overall story, and coherent narrative. While recent text-to-image models (Ramesh et al., 2021;Cho et al., 2020;Li et al., 2019a) are successfully generating high-quality images, they are not directly designed for narrative understanding over sequential text. Hence, story visualization necessitates independent research towards developing generative models for the task. In this paper, we explore the use of visuo-linguistic structured inputs and outputs for improving story visualization. Towards this end, we propose (V)isuo-spatial, (L)inguistic & (C)ommonsense i.e. VLC-STORYGAN which (1) uses constituency parse trees and commonsense knowledge as input using structure-aware encoders, (2) leverages a pretrained dense captioning model for additional position and semantic information, and (3) is trained using intra-story contrastive loss for maximizing global semantic alignment between input captions and generated visual stories.
Grammatical structures like constituency parse trees are potentially rich sources of information for visualizing relations between objects (or characters), their actions, and their attribute (property) modifiers. ; Nguyen et al. (2020); Xiao et al. (2017); Cirik et al. (2018) demonstrate that inducing such tree-structures within the encoder guides words to compose the meaning of longer phrases hierarchically and improves various tasks like masked language modeling, translation, visual grounding of language etc., suggesting potential gains in other tasks. Most text-to-image synthesis as well as story visualization models (Li et al., 2019c;Maharana et al., 2021) perform flat processing over free-text captions using LSTM or Transformer-based encoders. Hence, in order to leverage the grammatical information packed in constituency parse trees, we propose a novel Memory-Augmented Recurrent Tree-Transformer (MARTT) to encode captions and promote forward flow of hierarchical information across the sequence of captions for each story. Further, we find that input captions in story visualization lack details about the visual elements in the image. Hence, we augment the captions with external knowledge. For instance, when one caption mentions snow while the other mentions icy roads, we provide the knowledge that both are related to cold weather, encouraging the model to learn similar representations for either of the phrases. Dual learning has served as an effective method for promoting desirable characteristics in target output for both text-to-image generation (Qiao et al., 2019) and story visualization (Maharana et al., 2021). Song et al. (2020) use image segmentation to preserve character shapes while Maharana et al. (2021) use video captioning for global alignment between the input caption and the generated sequence of frames. Each of these auxiliary tasks generate uni-modal outputs, dealing either with image or text. In a bid to combine the benefits of learning signals from both visuo-spatial and language modalities, we propose the use of dense captioning as the dual task, which has proven useful as a source of complementary information for many vision-language tasks (Wu et al., 2019;Kim et al., 2020;Li et al., 2019b). Dense captioning models provide regional bounding boxes for objects in the input image and also describe the region. By using these outputs for dual learning feedback for story visualization, the generative model receives a signal rich in spatial as well as semantic information. The spatial signal is especially important for our task since the input captions do not contain any specifications about the shape, size or position of characters within the story. Further, we find that off-the-shelf dense captioning models, which are trained on realistic images from Visual Genome, transfer well to a markedly different domain like cartoon and can be used to provide visuo-spatial feedback without finetuning on target domain.
Finally, we want the model to recognize the subtle differences between frames in a story and generate relevant images that fit into a coherent narrative. Hence, we employ contrastive loss between imageregions and words in the captions at each timestep to improve semantic alignment between the caption and image. Adjacent frames in a story often contain subtle differences, as can be seen in an example in Fig. 1. We modify the region-word contrastive loss proposed in Zhang et al. (2021) for story visualization by sampling negative images from adjacent frames, forcing the model to recognize the difference between frames.
Overall, our contributions are: (1) We propose VLC-STORYGAN to use linguistic information, augmented with commonsense knowledge, for conditional image synthesis. (2) use dense-captioning to provide complementary positional and semantic information during training and show that off-theshelf models trained on Visual Genome can be effective without fine-tuning on the target domain. (3) propose intra-story contrastive loss between image regions and words to improve semantic alignment between captions and visual stories. (4) achieve strong improvements in several metrics and human evaluation for multiple datasets (PororoSV and FlintstonesSV), compared to previous state-ofart, and show the usefulness of structured inputs and outputs to provide insights for future work.
2 Related Work Story Visualization. The task of story visualization and the model StoryGAN was introduced by Li et al. (2019c). Zeng et al. (2019) and Li et al. (2020) used textual alignment modules and Weighted Activation Degrees respectively, to improve performance of StoryGAN. Song et al. (2020) add a figure-ground generator and discriminator to preserve the shape of characters. Maharana et al. (2021) demonstrate the effectiveness of video captioning as a dual task for story visualization and propose additional evaluation metrics. Notable recent models in the related field of text-to-image generation are large (Brock et al., 2018), trained on gigantic datasets (Ramesh et al., 2021) and are based on Transformer architectures Jiang et al. (2021). Mask-to-image generation modules  have proven effective for smaller datasets containing detailed captions and additional information for aligning image sub-regions to words within captions (Pont-Tuset et al., 2020). This is in sharp contrast to the datasets available for story visualization, which have been repurposed from video QA datasets and hence, contain short descriptions. Our work is based on exploring structured inputs and outputs for conditional image synthesis which has been largely unexplored in text-to-image synthesis and story visualization.
Story Understanding & Commonsense. Iyyer et al. (2016) introduced Relationship Modelling Networks to extract evolving relationship trajectories between two characters in a novel. Chaturvedi et al. (2017) use latent variables to weigh predefined semantic aspects like topical consistency to improve encoding for story completion. Guan et al. (2019);  augment story encodings with structured commonsense knowledge to improve story ending generation. We focus on the use of structured commonsense as well as grammatical trees to improve story encoding for the end goal of visualization.
Tree Encoder. Tree structures have traditionally been encoded using Tree LSTMs (Tai et al., 2015;Miwa and Bansal, 2016;Yang et al., 2017b,a). In recent work,  enforce a hierarchical prior in the self-attention layer of Transformer (Vaswani et al., 2017) and Harer et al. (2019) use a parent-sibling tree convolution block to perform structure-aware encoding. Nguyen et al. (2020) use sub-tree masking and hierarchical accumulation to improve machine translation. We propose a simpler Tree-Transformer architecture, augmented with memory units (Lei et al., 2020) for recurrence.
Contrastive Loss. Xu et al. (2018) first proposed the contrastive loss in text-to-image synthesis through the Deep Attentional Multimodal Similarity Model (DAMSM). ContraGAN (Kang and Park, 2020) performs minimization of contrastive loss between multiple image embeddings in the same batch, in addition to class embeddings (Miyato and Koyama, 2018). Zhang et al. (2021) combine inter-modality and intra-modality contrastive losses and observe complementary improvements. We adapt inter-modal loss for story visualization by sampling negatives from adjacent frames.
Dense Captioning. Dense captioning jointly localizes semantic regions and describes these regions with short phrases in natural language (Johnson et al., 2016). Wu et al. (2019) and Kim et al. (2020) use dense captions for visual and video question answering respectively. We use a pretrained dense captioning model to first annotate our target dataset and then use it within a dual learning framework to improve image synthesis for story visualization.

Background
Given a sequence of sentences S = [s 1 , s 2 , ..., s T ], story visualization is the task of generating a corresponding sequence of imagesX = [x 1 ,x 2 , ...,x T ]. The sentences form a coherent story with recurring plot and characters. The generative model for this task has two main modules: story encoder and image generator. The story encoder E(.) consists of a recurrent encoder which takes word embeddings {w ik } for sentence s k at each timestep k and generates contextualized embeddings {c ik }. E(.) also learns a stochastic mapping from S to a representation h 0 which encodes the whole story and is used to initialize hidden states of the recurrent encoder (Li et al., 2019c;Maharana et al., 2021). The image generator I(.) takes {c ik } and pools them into representations {o k } which are then transformed  into images {x k }. We train the model within a GAN framework (Goodfellow et al., 2014). The generated images are passed to image and story discriminators, which evaluate the images in different ways and send back a learning signal.
In VLC-STORYGAN, we use constituency trees as input to a structure-aware encoder. Further, we impose losses based on visuo-linguistic structures and contrastive loss on the model during training. We outline each of these modules in detail.

Memory-Augmented Recurrent Tree Transformer (MARTT)
Given a sentence s of length n, let G(s) be the constituency parse tree of s produced by a parser. T (s) denotes the ordered sequence of n terminal nodes (or leaves) of the tree and N (s) denotes the set of nonterminal nodes (or simply nodes), each of which has a phrase label (e.g., NP, VP) and spans over a sequence of terminal nodes. Each leaf em-bedding is the concatenation of word embedding and a corresponding node embedding. To compute node embeddings, we perform upward cumulative average (Nguyen et al., 2020) over the nodes (phrase labels) that the respective non-terminal leaf token is a child of. For instance, as seen in Fig. 3, the node embedding for the word Pororo is the average of embeddings for NNP and NP. The node representations are learnt during training and provide information about the phrase label classes for each token. The encoder receives the sequence of leaf embeddings, in the same order as in the sentence, as input. Within the encoder, the hierarchical structure of a parse tree is promoted by introducing sub-tree masking for encoder self-attention . For each word query, self-attention has access only to other members of the sub-tree at that layer. In Fig. 2, each token only attends to itself in the first layer of Tree-Transformer. In the next layer, says and hi can attend to each other as they belong to the sub-tree rooted at VP. Consequently, all tokens within says hi and smiles can attend to each other in the third layer. This bottom-up approach, paired with node embeddings, induces the model to build a hierarchical understanding of the sentence through compositionality.
Tree Transformer was originally designed to encode a single tree input whereas in our task, we need to encode a sequence of trees for the sequence of images we plan to generate. Hence, we tie a series of Tree Transformers together by introducing memory cells and memory updater modules in each layer of self-attention. At time step t, the input query matrix within the self-attention layer attends over [M l t−1 ;H l t ] where M ∈ R Tm×d and H ∈ R Tc×d (T m denotes memory state length and T c denotes length of caption). The memory state M l t−1 is updated to M l t following the steps outlined for memory updater in Lei et al. (2020).

Commonsense Knowledge
The input captions in most narrative datasets generally omit several relevant details about the plot or the background, which can be considered as commonsense. For example, in a scene where two characters are present outside on a sunny day, the caption does not explicitly mention the presence of a sky in the background or the brightness of the sun. Hence, in order to introduce this external knowledge and enrich the input captions, we extract commonsense concepts relevant to each frame. To do so, we follow Bauer et al. (2018) and use a simple entity-based method to extract relevant paths from ConceptNet (see Sec. 4 for details).
The commonsense knowledge paths are merged into a sub-graph which is then encoded using Graph Transformer. We use the Transformer-based graph encoder from Graph Writer (Koncel-Kedziorski et al., 2019) for structure-preserving encoding of graphs. First, the input graphs g k are converted into unlabeled connected bipartite graphs G k = (v k , E k ), where v k is the list of entities and relations, and e k is the adjacency matrix describing the directed edges (Beck et al., 2018). Next, v k is projected to a dense, continuous embedding space V k and is sent as input to the graph encoder. The encoder is composed of L stacked Transformer blocks; each Transformer block consists of a Nheaded self-attention layer followed by normalization and a two-layer feed-forward network. The resulting encodings are referred to as graph contextualized vertex encodings. The entity encodings e k are then appended to the output c k from MARTT and used in the alignment module (see Fig. 2).

Image Generation
The image generator follows the two-stage approach in prior text-to-image generation works (Qiao et al., 2019;Xu et al., 2018;Zhang et al., 2017;Maharana et al., 2021). The alignment module performs attention-based semantic alignment (Xu et al., 2018) between image regions h k and wordsm k = [f entity (e k ); f caption (c k )] in the current timestep. f entity and f caption are dense layers for projecting commonsense and caption encodings respectively, into the same space as image embeddings. β jik indicates the weight assigned by the model to the i th word when generating the j th subregion of the image. For the j th image sub-region, the word-context vector is calculated as: The generated images are sent to image and story discriminators and the corresponding classification loss is used for training. We use the discriminator models proposed in Li et al. (2019c). Given the sentence s k and the context information vector from the story encoder h 0 , the image discriminator attempts to distinguish between the generated and ground truth image x k , resulting in the loss L img . Similarly, the story discriminator classifies between the ground truth story and the generated sequence of imagesX to produce the loss L story . Additionally, the image discriminator is also used to classify the characters in the frame, when labels are available.

Dual Learning with Dense Captioning
As we discussed in Sec. 1, dual learning can provide important visual or semantic signals for improving story visualization, depending on which auxiliary task is chosen for the feedback model. We propose the use of dense captioning for providing visuo-spatial as well as semantic learning signals during training and use the model in Yang et al. (2017b) as the feedback model. 2 The dense captioning model is not fine-tuned on images from the story visualization dataset since it lacks dense caption annotations and it is prohibitively timeconsuming and expensive to gather such annotations for the task. Hence, we explore the use of Visual Genome-based predictions (Krishna et al., 2017) as "proxy" annotations for our dataset (see Fig. 2). Using these noisy predictions as groundtruth, we train the generative model to optimize for bounding box loss (L1 regression; L bbox ) as well as captioning loss (cross-entropy; L caption ).

Position Invariance via Bounding Box Loss:
The input captions in our dataset do not specify positions for the characters. Unless there is explicit positional input, it is unreasonable to expect the model to get the ground truth positions correct in generated images. Hence, in order to enforce positional invariance, we augment our dataset with mirror versions of the frames.

Contrastive Loss
As discussed in Sec. 3.4, the alignment and refinement module computes a pairwise cosine similarity matrix between all pairs of image-regions and word tokens, followed by the soft attention β i,j for image region j to word i. The aligned word-context vector a j for the j th sub-region is the weighted sum of all word representations. Following Zhang et al. (2021), the score function between all sub-regions h k for image x k and all wordsm k corresponding to caption s k is defined as S word (x k , s k ) = log( N j=1 exp(cos(h jk , a jk ))), where N is the total number of sub-regions. Finally the contrastive loss between the words and regions in image x k and its aligned sentence s k with respect to the story is defined as: where T is the total number of frames in a story.
Conditioning Mechanism. The story encoder E(.) encodes the entire story S into a single representation, h 0 , which functions as the initial memory state of the MARTT model, similar to Maharana et al. (2021). The input S is the concatenation of sentence embeddings s k ∈ R 1×ds from all timesteps. The conditional augmentation technique (Zhang et al., 2017) is used to convert S into a conditioning vector by using it to construct and sample a conditional Gaussian distribution i.e., h 0 = µ(S) + σ 2 (S) 1/2 S , where S ∼ N (0, 1) and represents element-wise multiplication. This introduces a KL-Divergence loss between the learned distribution and Gaussian distribution i.e., L KL = KL(N (µ(S), diag(σ 2 (S)))||N (0, I)).
Objective. The final objective function of the generative model is min θ G max θ I ,θ S [L KL +L img + L story + λ bbox L bbox + λ caption L caption + L word ] where θ G , θ I and θ S denote the parameters of the entire generator, and image and story discriminator respectively. λ values are weight factors for the respective losses.

Experimental Settings
Evaluation. We adopt the metrics proposed in Li et al. (2019c) and Maharana et al. (2021): • Character Classification: Frame accuracy (exact match) and classification F1-score using finetuned Inception-v3 to measure visual quality of recurring characters in predicted images. (Szegedy et al., 2016).
• R-Precision: R-Precision for global semantic alignment between predicted images and groud truth captions using the Hierarchical-DAMSM (Maharana et al., 2021).
• Frechet Inception Distance (FID): The distance between distributions of real images and generated images using pretrained Inception-v3.
Since story visualization datasets are adapted from a video captioning dataset, sometimes a single frame does not represent the caption perfectly. However, during training, we sample a frame from the video every time, thus providing coverage for the entire video and association between all characters in the story and their representation in the frame. With this process, the model is able to observe all characters from the caption in the target frames during training time. During inference, our target is a static story, and not a video. Hence, we evaluate the predictions under the assumption that all characters should appear in the frame.
Dataset. We use the PororoSV dataset proposed in Li et al. (2019c), and the splits proposed in Maharana et al. (2021) to evaluate our approach. Each sample in PororoSV has 5 frames and 5 corresponding captions that form a narrative. There are 9 recurring characters throughout the dataset. Each character is featured in at least 10% of the frames, making it crucial for the model to be capable of generating each of them. There are 10191/2334/2208 samples in training, validation and test splits respectively. The constituency parses are extracted and pre-processed using spaCy (Kitaev and Klein, 2018) and NLTK (Bird et al., 2009). 3 For commonsense knowledge, we first extract nouns and verb words from all of the captions in a story, and find  ConceptNet triples (Speer et al., 2017) containing at least one of those words in the subject and object phrases. Next, we use pretrained GloVe embeddings (Pennington et al., 2014) to find a broader pool of words which are related to the words and find additional relevant triples. These triples are combined into knowledge graph inputs for each frame. We use the top ten bounding box and caption predictions from a dense captioning model pretrained on Visual Genome (Krishna et al., 2017) for dual learning.
Experiments. Our model is developed using Py-Torch. All models are trained on the proposed training split and evaluated on validation and test sets. We select the best checkpoints and tune hyperparameters by using the character classification F-Score on the validation set.

Main Quantitative Results
The results on the PororoSV test set can be seen in Table 1. We compare our model VLC-STORYGAN to three baselines: StoryGAN (Li et al., 2019c), CP-CSV (Song et al., 2020) and DUCO-STORYGAN (Maharana et al., 2021) for PororoSV. The final rows contain results with VLC-STORYGAN, which outperforms previous models across most metrics for PororoSV. We see drastic improvements in FID score and sizable improvements in charcater classification as well as frame accuracy scores. This demonstrates the superior visual quality of stories visualized via our proposed method. There is a small improvement in BLEU score and a slight drop in R-Precision. The captions in PororoSV correspond more accurately to a video segment than a single image sampled from the segment (see example in Fig. 1). Hence, even though the metrics BLEU and R-Precision have been shown to be correlated with human judgement in text-to-image synthesis (Hong et al., 2018), the PororoSV dataset fails to be an appropriate testing bed for extending those metrics to story visualization. Since they are adapted from video datasets, there is poor correlation between a single frame and the caption that originally spanned an entire video clip. This leads to unstable results and smaller improvement margins for both metrics. Instead, the dataset presents a data-scarce scenario where the captions do not provide sufficient details for accurate generation of visual stories. This leaves ample scope for augmenting the input with external visual information such as scene graphs and dense captions, or structured knowledge such as commonsense graphs, as we have shown with our proposed model. The structured information in VLC-STORYGAN leads to better generation of multiple characters, as compared to Duco-StoryGAN (Fig. 1).

Human Evaluation
We conduct human evaluation on the generated images from VLC-STORYGAN and DuCo-StoryGAN, using the three evaluation criteria listed in Li et al. (2019c): visual quality, consistence, and relevance (see Appendix for details). Predictions from our model for PororoSV are preferred 62% of the times for better visual quality (see Win% columns). Our model also produces more consistent and relevant images, but the higher % of ties between the two models for these attributes indicate that much work remains to be done to improve global alignment between captions and images.
We also examine 50 random samples from the PororoSV dataset, and evaluate whether the bounding boxes predicted by the pretrained dense cap-  tioning model used in our approach are relevant to the task i.e. whether more than 50% of the predicted bounding boxes for each sample capture a meaningful part of the frame. We observe a high accuracy for PororoSV i.e. 68%. Table 3 contains minus-one ablations for VLC-STORYGAN on the PororoSV validation set. The first row shows results from the complete model VLC-STORYGAN. We then iteratively remove each of our contributions and observe the change in metrics. We obtain the largest drops in FID, character classification and frame accuracy by replacing MARTT with the structure-agnostic MART (second row). This suggests that the constituency tree, as well as the MARTT architecture, aids in comprehension of captions. We see similar but smaller drops with the exclusion of dense captioning from VLC-STORYGAN, since it provides important positional and semantic information about visual elements (third row). The minor margins for commonsense knowledge (fourth row) suggest that while it is a promising source of additional data, more work is needed for its proper integration with input captions. Finally, the results in the last row show that the intra-story contrastive loss is effective for global semantic alignment. 4 We also ran an experiment for isolating the effect of memory augmentation in our model, by training a non-recurrent (no memory) Transformer with Tree representations for single image generation instead of story generation, and evaluated using the story visualization metrics. We observed signifi-4 On the validation set of PororoSV, VLC-STORYGAN outperforms previous models across all metrics except FID score, where StoryGAN has the best FID score. This may be attributed to the 58% frame overlap between training and validation sets of PororoSV, since this trend does not transfer to the test set (where VLC-STORYGAN is best on FID) which has zero overlap with the training split. The FID metric has also been shown to be biased for finite sample estimates (Chong and Forsyth, 2020).

Analysis and Discussion
In this section, we take a closer look at the various data sources for VLC-STORYGAN.

Linguistic & Commonsense Knowledge
Results from Table 3 show that the grammatical structure of caption contributes to better understanding, which translates to improved visual stories. The improvement in frame accuracy further suggests that MARTT improves comprehension of   multiple characters simultaneously present in the narrative. In order to further analyze this premise, we examine a story involving several characters and compare predictions from VLC-STORYGAN and DuCo-StoryGAN in Fig. 4. The constituency parse tree in Fig. 4 shows the hierarchical understanding of the caption that is inherent in the MARTT architecture. Sub-tree masking allows the model to attend over multiple characters independently in earlier layers and combine the encoding in later layers. This semantic understanding is reflected in the image generated by VLC-STORYGAN which generates both characters mentioned in the caption distinctly, whereas Duco-StoryGAN barely generates one of them, validating the idea that grammatical knowledge is beneficial for story visualization.
In Fig. 5, we demonstrate an example of commonsense knowledge for a single frame in a story. We extract a sub-graph containing general information about car from ConceptNet (Speer et al., 2017) and use the graph contextualized embeddings from Graph Transformer for alignment with the gener-

VLC
Fred is in the living room. Fred is saying something in the room. Wilma and Fred are talking while in the room. Fred is standing in a room, talking to someone off camera right. Wilma is angrily talking to Fred in a room. ated image. The words door and seat rider correspond to specific sub-regions in the image and improve generalization.

Analysis of Dense Caption Feedback
We use the dense captioning predictions on ground truth images in the PororoSV dataset in order to obtain the dual learning loss signal for VLC-STORYGAN during training. While we expected the predictions to be noisy, we found many of the predictions to be surprisingly relevant to the Poro-roSV dataset. For instance, most of the characters in PororoSV were identified as teddy bear or stuffed toy or animal and the dense captioning model provides roughly accurate bounding boxes for the entire character or prominent body parts (see Fig. 6). This explains the improvement in character classification scores with the addition of dual learning via dense captioning in our model. Many of the background elements in the stories, such as blue sky, wooden table, snow, and green tree look similar to their realistic counterparts in our cartoon setting. The captions are usually missing descriptions as well as positions of these minute details, whereas the dense captioning model provides precise locations and descriptions for the same.

Generalization to Flintstones Dataset
In order to measure the generalization of our approach to another dataset, we transformed the Flintstones dataset presented in the text-to-video synthesis work, CRAFT (Gupta et al., 2018), into story visualization. A single frame is sampled from each video clip and frames from adjacent clips are gathered into stories of length 5 (similar to PororoSV). The resulting dataset, Flint-stonesSV, has 7 major recurring characters and has 20132/2071/2309 samples in the training, validation and test splits. Our model VLC-STORYGAN outperforms DuCo-StoryGAN on all metrics. We see 3.89% and 2.84% improvements in character F1-score and frame accuracy with our structured framework. Additionally, the FID drops by 5.15% suggesting large improvements in visual quality (see Fig. 7). Under human evaluation, predictions from VLC-STORYGAN are preferred as much or more than those from Duco-StoryGAN for 90% of the times. The % of ties for all attributes is high, leaving scope for future research into this dataset.

Conclusion
In this paper, we investigate the use of structured knowledge for the task of story visualization. We propose a novel recurrent Tree-Transformer for encoding constituency trees and augment it with commonsense knowledge. We train the model using dense captioning loss and intra-story contrastive loss. Our results demonstrate the effectiveness of these approaches. We believe that these methods will encourage the use of structured knowledge for story visualization and text-to-image synthesis.

Ethics/Broader Impacts
The datasets and corresponding train/validation/test splits used in this paper were proposed by Li et al. (2019c), Kim et al. (2017), Gupta et al. (2018) and Maharana et al. (2021). All the samples in the dataset consist of simple English sentences and cartoon images. Our experimental results are specific to the task of story visualization. The pretrained dense captioning model used in our paper is trained on English text and real-world images. All other models used and developed in our paper are trained on English text and cartoon images. By using cartoon images in our task, we avoid the egregious ethical issues associated with real-world usage of image generation such as DeepFakes. We focus not on generating realistic images, but on improved multi-modal understanding in the context of story visualization. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.

A.1 Dense Captioning
For position invariance, we augment the PororoSV dataset with the mirror versions of the images and the corresponding mirror versions of the bounding box region predictions. When computing bounding box loss in dual learning, we compute the loss with the original bounding box prediction as well as its mirror version as target and retain the one which is lower. This way, we avoid penalizing the model for inverted positions of the characters since we do not provide explicit positional input to the model.

A.2 Story & Image Discriminators
We use the story and image discriminators as outlined in StoryGAN (Li et al., 2019c). The image discriminator is given the generated imagex k , the sentence s k , and the context information vector from the story encoder h 0 , and distinguishes between a corresponding real triplet, containing the same information except for the real image x k instead of the fake image (L img ). Additionally, the image discriminator also classifies the characters in the frame. The story discriminator evaluates the entire story S and the generated sequence of imageŝ X.

A.3 Image Generation
The image generator follows the two-stage approach in prior text-to-image generation works (Qiao et al., 2019;Xu et al., 2018;Zhang et al., 2017;Maharana et al., 2021). The first stage uses outputs from the encoder; the resulting image is fed through a second stage, which weighs the outputs from the structure-aware context encoder as well as commonsense encoder, according to the image sub-regions and reuses for generation. The alignment module performs attention-based semantic alignment (Xu et al., 2018) between image regions h k and wordsm k = [f entity (e k ); f caption (c k )] in the current timestep. f entity and f caption are dense layers for projecting commonsense and caption encodings respectively, into the same space as image embeddings. β jik indicates the weight assigned by the model to the i th word when generating the j th sub-region of the image. For the j th image sub-region, the word-context vector is calculated as: Li et al. (2019c) propose the character classification accuracy (exact match) within frames of generated visual stories as a measure of visual quality. Maharana et al. (2021) propose an additional set of automated evaluation metrics that capture diverse aspects of a model's performance on visual story generation. We adopt those metrics for evaluating our models: • Character Classification: We use the finetuned Inception-v3 (Szegedy et al., 2016) and report frame accuracy and character F1-score.
• Video Captioning Accuracy: We use the pretrained MART video captioning model (Lei et al., 2020) and report BLEU2/3 scores for the generated captions.
• R-Precision: We use the Hierarchical-DAMSM (Maharana et al., 2021) to report R-Precision scores on the pairs of ground truth captions and generated stories.
• Frechet Inception Distance (FID): We report the FID score, which is a metric used for evaluating the distance between real images and generated images for text-to-image synthesis datasets.

B.2 Hyperparameters
The image size that we use is 64-by-64, and the length of the story is 5 images/captions, same as DuCo-StoryGAN. The learning rates of the generator and discriminator are 2e-4. The model is trained for 150 epochs and the learning rate is decayed every 30 epochs. For each training update of the discriminators, two corresponding updates are performed for the generator network, with different mini-batch sizes for image and story discriminators (Li et al., 2019c). The image discriminator batch size is 60 and the story discriminator batch size is 12. We found in our experiments that story visualization models are prone to mode collapse at lower batch sizes, which is not resolved with perceptual loss in contrast to conventional knowledge. The above-mentioned hyperparameters are optimized using 12 iterations of manual tuning. The MARTT hyperparameters are as follows: The hidden size of the model is 192. The number of memory cells is 3. The number of hidden layers is 4. The dropout values across the model are 0.1. The layer normalization epsilon is 1e-12. The number of attention heads is 6. The word embedding size is 300 which is initialized using the 840B glove training checkpoint. The node embedding size is 50.
The total number of trainable parameters in the VLC-STORYGAN is approximately 100M. We use the ADAM optimizer with betas of 0.5 and 0.999. We train the model on a single RTX A6000. Each epoch takes 50 minutes, with the model being saved every 10 epochs. At 150 epochs of training, the total training time is nearly 4 days.

C Results
See examples of predictions for PororoSV and FlintstonesSV from VLC-STORYGAN in Figures 8 and 9 respectively.

C.1 Human Evaluation
We conduct human evaluation on the generated images from VLC-STORYGAN and DuCo-StoryGAN, using the three evaluation criteria listed in Maharana et al. (2021): visual quality, consistence, and relevance. Two annotators are presented with a caption and the generated sequence of images from both models, and are asked to state their preferred sequence for each attribute. They also have the option to pick none if both images fare the same. In terms of visual quality, predictions from our model are preferred 62% of the times, as compared to 28% for DuCo-StoryGAN (see Win% columns) for PororoSV. Our model is also preferred more times for the attributes consistency and relevance, but the higher % of ties between the two models for these attributes indicate that much work remains to be done to improve global alignment between captions and images.