Learning to Model Multimodal Semantic Alignment for Story Visualization

Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story, where the images should be realistic and keep global consistency across dynamic scenes and characters. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. To address this problem, we explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model. More specifically, we introduce dynamic interactions according to learning to dynamically explore various semantic depths and fuse the different-modal information at a matched semantic level, which thus relieves the text-image semantic misalignment problem. Extensive experiments on different datasets demonstrate the improvements of our approach, neither using segmentation masks nor auxiliary captioning networks, on image quality and story consistency, compared with state-of-the-art methods.


Introduction
Story visualization is a challenging task, which aims to generate a sequence of story images given a multi-sentence story, and further requires output images to be consistent, e.g., having a consistent background or character appearance.Regardless of its difficulties, story visualization has the potential for many applications, including art creation, computer-aided design, and image editing.
To address the challenges, current methods (Li et al., 2019c;Song et al., 2020;Maharana et al., 2021;Maharana and Bansal, 2021) adopt a fixed StoryGAN-based (Li et al., 2019c) architecture, where two GANs (Goodfellow et al., 2014) are adopted, one for single-image quality, and one for story consistency, without considering the semantic alignment between different-modal text and image features involved in the generation process.
So, one problem arising is that a fixed network with the involvement of different-modal representations (e.g., text and image) may suffer from a semantic misalignment problem.This is because current methods usually adopt a fixed text encoder and image encoder to extract corresponding features, and then use these features directly in an also fixed GAN-based network.However, text representations can be a coarse sentence vector, finegrained word embeddings, or a structured knowledge graph (Mahon et al., 2020), while Gatys et al. (2016) and Johnson et al. (2016) have shown that image features extracted from different layers of a convolutional neural network (e.g., VGG) may contain different-level semantic information.Based on this, simply fusing cross-domain representations together using a fixed generative network may cause a considerable adverse impact on the quality of output images.For example, one of the main function of the discriminator in a conditional GAN is to evaluate the semantic alignment between input text and output image, and provide the corresponding feedback to the generator, which encourages the generator to generate text-semantic-matched images.However, to evaluate the semantic alignment, the discriminator in current methods only simply concatenates a coarse sentence vector and image features at a small-scale (i.e., 4×4), extracted from a given image using a series of convolutional layers, which may fail to fully match the semantics between text and image features, and thus provide a less precise training feedback to the generator.
To address this problem, we explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.More specifically, we introduce dynamic interactions according to learning to dynamically explore various semantic depths and fuse the different-modal information at a matched semantic level.By doing this, the network can learn to dynamically utilize various semantic level information from the given representations, and also learn to selectively fuse them together, which thus mitigates the semantic misalignment problem across different modalities.So, the main contributions are summarized as follows: • We fully explore the semantic misalignment problem existing in current methods, and propose a novel single-GAN based network, which improves FID from 78.64 to 52.87, and FSD from 94.53 to 55.20 on Pororo-SV, and establishes a benchmark FID of 74.12 and FSD of 20.07 on Abstract Scenes.
• We conduct extensive experiments and a thorough analysis of aligning the semantics between different-modal inputs to provide general modeling insights into conditional GANs.

Related Work
Story visualization aims to generate a sequence of consistent images corresponding to a multisentence story.StoryGAN (Li et al., 2019c) introduced a two-GAN-based generation network.CP-CSV (Song et al., 2020), DUCO (Maharana et al., 2021), and VLC (Maharana and Bansal, 2021) were built on StoryGAN, where CP-CSV utilized character segmentation masks to improve the performance, and DUCO and VLC adopted auxiliary captioning networks to build a text-imagetext circle to ensure the consistency between the input and output.Recently, Li and Lukasiewicz (2022b) proposed to utilize fine-grained word information to build a concise single-GAN based network, and Li et al. (2022a) proposed to combine both clustering learning and contrastive learning together to ensure better text and image representations in a joint space.However, all these methods were based on a fixed network without considering the semantic alignment between involved text and image representations.

Overview
Differently from current methods that adopt two GANs, our network only has a single GAN, nei-ther requiring additional segmentation masks nor auxiliary networks for supervision, as we experimentally find that a single-GAN-based network can effectively produce high-quality story images with a good consistency.We attribute this improvement to the exploration of semantic alignment between different-modal representations, which enables fine-grained training feedback from the discriminator to the generator.
Given a story X with n story sentences, a text encoder encodes each story sentence S i into a sentence vector s ∈ R D with corresponding word embeddings s word ∈ R D×L , where D is the feature dimension, and L is the number of words in a sentence.Then, we feed this n sentence vectors into the generation pipeline using upsampling blocks to produce story images at the required resolution.Meanwhile, we further incorporate word embeddings into the generation pipeline, according to our proposed dynamic interactions, allowing the generator to learn to choose semantically aligned inputs from different semantic levels and modalities to achieve a better generation.In the discriminator, we also adopt the proposed dynamic interactions to fuse both text and image features, enabling a better evaluation on the text-image semantic alignment.The complete architecture is shown in Fig. 1.

Exploration of Semantic Alignment
To ensure the semantic alignment between text and image representations, we introduce dynamic interactions via utilizing self-attention and cross-modal attention, learning to dynamically explore various semantic depths from these different-modal representations, and also to fuse them together at a matched semantic level, which thus mitigates the text-image semantic misalignment problem.

Attention Mode
Two attention modes are adopted in our approach, one is self-attention (Zhang et al., 2019) (SA), and one is word-level spatial attention (Xu et al., 2018) (WSA).To generalize both types of attention, we set a as one input, and b as the other input, where a denotes intermediate image features in the generator or discriminator, and b denotes word embeddings in WSA, or the same image features as a in SA.So, the attention weights can be achieved via β = Softmax(ab T ).Then, we can get weighted hidden features h via h = βb.Finally, we selectively fuse the weighted hidden features into the network using the proposed dynamic block (details  are shown in Section 3.1.2).Note that both types of attention share the same attention weights, and SA mainly focuses on capturing correlations between long-range pixels within the same image (Li and Lukasiewicz, 2022a), and WSA mainly focuses on fusing cross-domain text information with intermediate image features in both the generator and the discriminator at a matched semantic level.

Dynamic Block
Our proposed dynamic block is in Fig. 1, right.Unlike current methods that only allow the interaction between different modal representations at specific locations with fixed semantics, we propose the dynamic block to learn to achieve a dynamic interaction in an end-to-end manner.Our dynamic block is based on a attention selector, which selectively chooses an appropriate attention (i.e., SA or WSA) to further explore semantic depth or fuse these cross-domain pieces of information together.
To achieve the selection effect, we first get global representations ā and b of both a and b (see Section 3.1.1)using average pooling.Then, the correlation w between a and b can be obtained via w = Sigmoid(ā b), where w denotes the correlation level between a and b.Then, Gumble-Softmax reparameterization (Jang et al., 2016) is adopted to choose a particular attention, based on the probability of each attention, e.g., the probability of SA can be defined as: where z = −log(−log(µ)) is sampled Gumble noise, µ is drawn from the uniform distribution, and τ is a hyperparameter.Similarly, p(WSA) is obtained from p(SA) by replacing w with 1 − w in the numerator.So, given a and b, our dynamic block performs soft weighting in training and hard selection at inference, denoted as: where h SA denotes using self-attention to further explore semantic depth, and h WSA denotes fusing finer word information into the generation pipeline.

Objective Functions
The training follows the training procedure of GANs, where the generator and discriminator are trained alternatively by minimizing their losses.

Implementation
We evaluated our approach at the resolution 64×64.
The text encoder is a bi-directional LSTM, pretrained to maximize the cosine similarity between matched image and text features (Xu et al., 2018).We selected the best checkpoints and tune hyperparameters by using the FID and FSD scores.
The network was trained for 120 epochs on both Pororo-SV and Abstract Scenes.The Adam optimizer (Kingma and Ba, 2014) was adopted with learning rate 0.0002.We evaluated our approach on a single Quadro RTX 6000 GPU.

Datasets
Pororo-SV is adopted to evaluate our approach, which is built on PororoQA, a dataset for video question answering (Kim et al., 2017).In Pororo-SV, each story has five consecutive images with corresponding text descriptions.There are 13, 000 story samples in the training set, and 2, 336 story samples in the test set.Differently, we do not evaluate our approach on CLEVR-SV (Li et al., 2019c), as there are only 15 different words in the entire CLEVR-SV dataset, which might fail to fully explore the multimodal network for the story visualization task.We adopt Abstract Scenes (Zitnick and Parikh, 2013) to further evaluate our approach.Abstract Scenes was proposed for studying semantic information, which contains over 1, 000 sets of 10 semantically similar scenes of children playing outside.The scenes are composed of 58 clip-art objects, and there are six sentences describing different aspects of a scene.In this dataset, we treat scenes from the same set as a story, as they are all created from the same seed scene, sharing similar semantic information.

Evaluation Metrics
The Fréchet inception distance (FID) (Heusel et al., 2017) and the Fréchet story distance (FSD) (Song et al., 2020) are adopted as quantitative evaluation metrics to evaluate the performance of our approach.FID computes the Fréchet distance between the distribution of real images and the distribution of fake images.Differently from FID focusing on single image, FSD is proposed for the story visualization task, which takes the sequence of images into account.FSD is built on the principle of FID by using R(2 + 1) (Tran et al., 2018) as backbone model, where R(2 + 1) has a flexible sequence length and the strong ability to capture temporal consistency.However, as both FID and FSD cannot reflect the semantic alignment between sentences and story images, following (Li et al., 2022a;Li and Lukasiewicz, 2022b), we compute the average cosine similarity (Cosine) between pairs of sentence and synthetic image over the testing set, and further scale the value by 100.
Besides, we show the number of parameters in the generator and the discriminator for different methods on Pororo-SV to compare the size of different networks.

Qualitative Evaluation
Figs. 2 and 3 show examples of a visual comparison between our approach and the baselines on Pororo-SV and Abstract Scenes, respectively.Our approach generates realistic images with better regional details, text-image alignment, and consistency, for example, shown in Fig. 2, the characters Pororo (i.e., penguin) and Crong (i.e., frog) have a sharper shape with fine-grained regional details, such as hats and glasses, and shown in Fig. 3, our approach generates a ball, aligned with the given first sentence, while other methods fail to generate the required ball object on the grass.

Quantitative Evaluation
Table 1 shows a quantitative comparison with the following widely used metrics on Pororo-SV and Abstract: FID (Heusel et al., 2017), FSD (Song et al., 2020), and Cosine (Li et al., 2022a).From the tables, we can observe that our approach achieves better results against others.This illustrates that our approach can generate images with finer quality, achieve better image-text semantic alignment, and keep higher consistency across story images.
We further compare the number of trainable parameters in different methods.As our method is a single-GAN based network, compared to Story-GAN, it reduces the size of the generator by about 46.59%, and of the discriminator by about 50.21%.

Ground-Truth
StoryGAN DUCO 1. Mike and Jenny are playing the ball.2. Jenny is mad that Mike is not catching the ball.3.There is a soda sitting in the grass in front of Jenny.4. Jenny is wearing a beanie.5. Mike is very angry.

Component Analysis
Table 2 shows a component analysis to evaluate the effectiveness of different components.Without either attention in the dynamic block, the performance of our model degrades, and the worst performance is obtained when the model is without using the entire dynamic block.This demonstrates that (1) each attention mode is important in the dynamic block, and (2) simply fusing different-modal representations without considering their semantics fails to comprehensively improve the performance.
Besides, we further consider the effectiveness of the dynamic block in the generator and discriminator.The worse performance in "Ours w/o DB in D" shows that the dynamic block plays a more important role in the discriminator.We think this is because the discriminator needs to provide training feedback to the generator, in terms of image quality and text-image alignment.If the discriminator fails to match the semantics between text and image representations, training feedback may be less precise, which may hinder the generator to generate highquality story images.The degraded performance in "Ours w/o DB in G" is because the generator cannot effectively capture the correlation between text and image representations, even though there is precise and fine-grained training feedback provided by the discriminator.This demonstrates the complementary effect between the adoption of the dynamic block in both the generator and discriminator.

Human Evaluation
Similarly to (Li et al., 2022a), a human evaluation on Pororo-SV is conducted based on three evaluation criteria: (1) visual quality, (2) text-image se- mantic alignment, and (3) consistency across story images.We asked workers to decide which sample is the best, where each sample contains story images and corresponding sentences.100 randomly selected samples are assigned to three workers to reduce the human variance.Workers prefer the results that are generated by our approach.

Conclusion
In this paper, we explored the semantic misalignment problem existing in current story visualization methods, and further proposed dynamic interactions via learning to dynamically explore various semantic depths and fuse the different-modal information at a matched semantic level.Experiments demonstrate the superior performance of our proposed single-GAN based approach, with a fewer number of parameters, neither using segmentation masks nor auxiliary captioning networks.

Limitations
Our method may fail to generate high-quality results when a given multi-sentence story is complex, e.g., it describes multiple characters (e.g., > 3) with various backgrounds.Currently, similarly to current methods, our approach focuses more on small-size story image generation, which means that when the number of images in a story is larger (e.g., > 15), our method may fail to ensure consistency between different story images.

Figure 1 :
Figure 1: Architecture of the proposed method (left) and dynamic block (right).

Table 1 :
Quantitative comparison between different methods on Pororo-SV and Abstract Scenes.For FID, FSD, and the number of parameters, lower is better.For Cosine, higher is better.

Table 2 :
Ablation study on Pororo-SV."Ours w/o both Attn" denotes without using both attention in the dynamic block (DB); "Ours w/ SA Only" denotes only adopting the self-attention in DB; "Ours w/ WSA Only" denotes only adopting the word-level spatial attention in DB; and "w/o DB in G (or D)" denotes without using DB in the generator (or discriminator).

Table 3 :
Human evaluation on Pororo-SV between VLC, DUCO, and Ours based on three criteria.