Sound of Story: Multi-modal Storytelling with Audio

Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called"background sound"which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called"Sound of Story (SoS)", which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https://github.com/Sosdatasets/SoS_Dataset.


Introduction
Story understanding and telling is one of the important but challenging problems in natural language processing and machine learning, and there have been extensive studies on this problem for decades (Harrison et al., 2017;Kočiskỳ et al., 2018).Many prior researches have focused on building models that understand stories such as utilizing pre-trained model (Chen et al., * These authors contributed equally to this work.2022; Wang et al., 2023;He et al., 2023), eventcentric understanding (Chen et al., 2021) and using multiway-sampler (Xu et al., 2023).Also, lots of approaches are proposed that aim to understand the stories within the provided text (Mostafazadeh et al., 2016;Fan et al., 2018;Akoury et al., 2020), image (Huang et al., 2016;Bensaid et al., 2021;Krojer et al., 2022) and video (Rohrbach et al., 2015;Xu et al., 2016;Tapaswi et al., 2016;Li et al., 2019;Bain et al., 2020;Huang et al., 2020).
Despite there being many benchmarks for story understanding and telling, most of them less focus on audio even if it is one of the key elements for humans to understand stories (Lang et al., 1998;Lake et al., 2017).Therefore, we propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information.For this purpose, we introduce a novel dataset called the Sound of Story (SoS), which aims to leverage audio information for story understanding, distinguishing it from other existing story or audio datasets.
To create the SoS dataset, all the modalities are extracted from Condensed Movie Dataset (CMD) (Bain et al., 2020) and Large Scale Movie Description Challenge datasets (LSMDC) (Rohrbach et al., 2015).Image and text data extracted from the same movie clip have their contexts and both of them have the possibility to infer the story from the same movie clip.And in the case of audio such as background sound and music, it has the ability to enrich storytelling by representing and recognizing the context of the situations (Eronen et al., 2005;Stowell et al., 2015).Due to these characteristics, audio, image, and text extracted from the same movie clip become correlated and complementary.So, our SoS dataset allows three modalities generated from the same movie clip to be used for story understanding as a pair.In addition, we propose new story under- standing benchmarks and their baseline models by introducing audio retrieval tasks and audio generation tasks using the SoS dataset.
These background sound tasks can be utilized in various applications.For example, for movies and cartoons, background sound is often produced according to a designated scene using various props or objects for a sense of reality.In the case of the retrieval task, among the sound sources made, it is possible to recommend the sound source which fits the best to the scene.Furthermore, background sound generation can produce more realistic audio while reducing human resource costs by directly generating the right audio for a given scene.The overall data generating process can be shown in Figure 1.
In summary, our contributions are as follows: • We propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information, which has not been well explored.
For this purpose, we introduce a new dataset named Sound of Story (SoS) consisting of text and image sequences and audio for a story.To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound.
• As benchmark tasks to show the significance and relevance of sound in a story, we introduce retrieval tasks between audio and other modalities and background sound generation tasks, and baseline models for each of them.
• We will release the dataset and code to gen-erate them so that the community can generate more data for a variety of tasks to utilize sound in storytelling and story understanding research.

Related Work
Story understanding and telling Storytelling tasks pose significant challenges in the domain of natural language processing (NLP), and researchers have proposed various datasets and benchmarks to tackle these problems.Wikiplots, WritingPrompts (Fan et al., 2018), and ROCStories (Mostafazadeh et al., 2016) introduce datasets that revolve around text-based story plots.Wikiplots comprises a collection of approximately 113k story plots extracted from English Wikipedia.WritingPrompts (Fan et al., 2018) assembled their dataset by gathering pairs of human-written stories and corresponding prompts from online sources.ROCStories (Mostafazadeh et al., 2016) employed the use of Amazon Mechanical Turk (AMT) to generate short stories that incorporate common-sense knowledge, with a focus on everyday topics.These datasets are usually used for story understanding or story generation tasks.
In addition to text-based story plots, there exist numerous story datasets that incorporate images or videos.Visual Storytelling (Huang et al., 2016) introduces the first sequential vision-to-language dataset and explores its applicability in various tasks.It goes beyond simple object understanding or captioning for concrete scenes by considering input that encompasses temporal events.Similarly, ImageCoDe dataset (Krojer et al., 2022) also contains stories but is designed to enable models to  learn detailed situations from finer visual differences within similar sets of images.Movie data inherently encompasses storytelling.Various datasets such as MovieQA (Tapaswi et al., 2016) and MovieNet (Huang et al., 2020) have been collected by using movie data and have been used in various multimodal domains.Additionally, datasets like CMD (Bain et al., 2020) and LSMDC (Liu et al., 2019) not only provide movie videos but also include descriptions for the corresponding clips.This offers researchers an opportunity to explore models from a multimodal perspective.However, there are few datasets for story with sound.Existing benchmarks for audio mostly consist of music (Bertin-Mahieux et al., 2011;Sturm, 2012;Ferraro et al., 2021) or animal/environmental sounds (Salamon et al., 2014;Piczak, 2015;Gemmeke et al., 2017;Kim et al., 2019), and are designed to evaluate simple matching performance between objects (or sentiment) and audio input.
Multi-modal retrieval tasks Multi-modal retrieval models are trained to find similar features across different modalities by fusing.These models are tasked with understanding both the context of a given input and the interplay between features provided by other modalities.As a result, retrieval becomes a highly challenging and crucial task in order to comprehend the distinctive characteristics of multiple modalities.

Multi-modal audio generation
There have been prior studies on generating audio by receiving cer-  1: Statistics compared to different storytelling datasets.In the audio types, "open" represents raw audio from the video, and "speech-decoupled audio" represents the speech-excluded audio.In the text types, "Caption" represents the text that explains only the image, and "Description" represents the text that includes the story.
tain modality information as input (Vasquez and Lewis, 2019;Yu et al., 2022;Borsos et al., 2022;Neves et al., 2022;Schneider et al., 2023;Yang et al., 2023;Agostinelli et al., 2023).Among several audio generation tasks, recently text-audio generation has been actively studied.AUDIO-GEN (Kreuk et al., 2022) uses an augmentation technique that mixes different audio samples when generating audio samples from captions.Riffusion (Forsgren and Martiros, 2022) fine-tunes the model to generate Spectrogram images using the image-to-text based on Stable Diffusion (Rombach et al., 2022).MUSICGEN (Copet et al., 2023) uses a single-stage language model with efficient token interleaving patterns and an unsupervised melody conditioning method.In this paper, we are interested in generating relevant audio given a story.

Dataset
We propose a novel dataset named the SoS dataset with 27,354 audio-image-text pairs.22,330 pairs are created from CMD (Bain et al., 2020) and 5,024 from LSMDC dataset (Rohrbach et al., 2015).We extract the image and audio from the same video and pair them with the description accordingly so that the images and text share the same storyline.In the case of audio, it is related to the other modalities due to being extracted from the same clip so can be used in complementary.The examples of the SoS dataset are shown in Figure 2.

Movie Clips
CMD (Bain et al., 2020) provides the key clips and their high-level semantic descriptions from 3,605 movies.Each key clip has a duration of around 150 seconds and can be automatically obtained from YouTube, making it easily accessible.LSMDC (Rohrbach et al., 2015) consists of 118,081 short clips extracted from 202 movies.Each clip in LSMDC has a duration of around 3 to 12 seconds and is accompanied by captions generated from Description Video Services (DVS).In our work, we merge these two movie clip datasets to create the audio-image-text paired dataset.To bridge the gap between CMD and LSMDC, we adapt the LSMDC dataset to match the format of CMD.Since the short movie clips in LSMDC are contiguous, we concatenate them in sequential order until their combined length is similar to that of CMD.Specifically, we concatenate around 20 short movie clips together to create the raw movie clip data.

Audio processing
To facilitate audio processing, we first utilized ffmpeg (Tomar, 2006) to extract audio from the video, ensuring that the length of the extracted audio matched that of the raw movie clip video.Subsequently, we employed the bytesep framework proposed by (Kong et al., 2021) to decouple speech from mixed audio containing various sounds.The results of decoupled sounds may be considered as audio data that share the storyline with the raw movie clip video without any linguistic information.Speech-decoupling is proceeded because the speech itself includes the storyline and the model can learn the storyline focused on the speech, not the audio which is not our expectation.Duration distribution of the extracted audios can be seen in Figure 3.

Image processing
Since videos include many duplicated and similar images, only the key salient images that may represent storyline were extracted through image clustering.We conduct image clustering1 using image fingerprints, which are representations for uniquely identifying an image generated based on the visual features.From all image fingerprints extracted from frames of the movie clips, we gradually removed images with a fingerprint similarity of greater than 0.5 which was set empirically found.Through this process, fewer frames for static and more frames for dynamic scenes could be retained.
The extracted image sequences can represent the entire video by capturing salient scenes with significant variations, even though it does not include all of the frames.By utilizing this approach, we can create essential image sequences of the video by focusing on scenes with meaningful changes.These processed image sequences provide a novel dataset that highlights the storyline with fewer frames and differentiates it from traditional video datasets.Figure 3 shows the distribution of extracted frames for each movie clip.

Text processing
There exist slight differences in clip lengths between CMD and LSMDC even though both of them provide short descriptions for each movie clip.
Because we concatenate LSMDC movie clips to match the length of CMD movie clips, descriptions also must be formatted by concatenating the same number of descriptions with the video.As a result, the concatenated descriptions exhibit variations in sentence lengths as in Figure 3. Furthermore, there are also inherent differences between the two datasets in how the descriptions were generated.CMD utilizes descriptions generated by users from YouTube, while LSMDC employs descriptive video services to generate descriptions.Because of this text variation difference between the two datasets, it makes our task more challenging by requiring high-level text understanding.We can check that the descriptions are well-aligned with other modalities in Figure 2.

Data Statistics and Analysis
Our SoS dataset is derived from 3,807 movies covered in the CMD and LSMDC datasets, resulting in a total of 27,354 clips.After processing, each clip has an average of 20 images, and the total number of images in our dataset is 535,707.The accumulated length of the extracted audio is around 984 hours, and each audio consists of around 150 seconds on average.Each clip's text description consists of 88 words on average.If divided into two different datasets, CMD and LSMDC consist of an average of 83.2 words and 1,378.7 words, respectively, with medians of 80 and 130.8, respectively.Quantitative comparison with other storybased datasets is shown in Table 1, and it can be seen that our proposed SoS dataset is the largest well-curated storytelling dataset with extracted audios.
We generate paired decoupled audio data that can be used complementarily for story understanding and telling with the image and text in the SoS dataset.The extracted decoupled audio, which is in-the-wild audio, contains abundant information.Specifically, we decoupled speech information to leave only pure audio consisting of background sound on the audio dataset.These remaining audios contain a mixture of various daily-life sounds and background music.This distinctive characteristic sets it apart from other audio datasets, making it an adequate audio dataset for story understanding.

Retrieval Tasks
In this section, we introduce four retrieval tasks: audio-to-video, audio-to-text, text-to-audio, and video-to-audio retrieval using our proposed dataset.corresponding to image sequences or text information containing a story, or matching image sequence or text given audio, which may be used as recommendation for background sound suitable for image sequence or storyline.

Retrieval Model
In this section, we introduce the model used in our proposed retrieval task.We are inspired when we create a baseline model in that the Attention Over Simplicity Matrix method proposed by X-CLIP (Ma et al., 2022) can be flexibly applied to Audio as well as Video-Text.Our overall architecture is illustrated in Figure 4.
For descriptions, we use the CLIP text encoder E T (Radford et al., 2021).The information obtained from the E t is defined as sentence feature f s and word feature f w .In the case of video, the CLIP image encoder E I is used and the temporal encoder E t is added to obtain the temporal sequence feature.The obtained feature is defined as frame feature f f , and the mean pooling of the frame feature is defined as video feature f v .And in the case of the audio, audio information is extracted by performing audio embedding on Log Mel-filter bank coefficients and fed into the E A using pretrained Audio Spectrogram Transformer (Gong et al., 2021).In this case, the first token among the acquired feature is defined as audio f a , and the remaining tokens are defined as audio sequence feature f as .
In the case of audio-video retrieval, we first calculate the similarity between audio-video(S a,v ), audio-frame(S a,f ), audio sequence-video(S as,v ), and audio sequence-frame(S as,f ).These similarities are calculated following Equation 1 to 4: where m is the number of video frames and n is the number of audio sequences.Except for audio-video similarity, an aggregation process is required to make it an instance-level similarity.Therefore, the audio-frame and audio sequence-video similarity is calculated through the following equations: The audio-level similarity(S ′ aud ) and video-level similarity(S ′ vid ) are calculated to obtain the audio sequence-frame instance-level similarity: where * denotes all elements in that dimension.
In addition, S ′ as,f are obtained from the means of S ′ aud and S ′ vid obtained by Equation 8and 10.Therefore, the similarity between video and audio is obtained as the average of the obtained similarities, as shown in Equation 11.
For the audio-text retrieval, the similarity Sim AT can be calculated by replacing videorelated features with text-related features.

Implementation Details
Of the 27,354 stories in our SoS dataset, 26,054 are split into train and 1,300 into test sets.In the case of video, a total of 40 images are used as input.At this time, if there are fewer than 40 images, 40 images are filled with duplicate sampling in accordance with the time order, and if there are more than 40, 40 images are obtained by uniform sampling.In the case of audio, the entire audio is used for learning, and random cropping is performed at this time.In the test, the same audio can be used in all tests by cropping and using the information-rich area of the audio.In the case of text, after tokenizing the description with a CLIP tokenizer, we crop 77 tokens from the front and use them similar to other CLIP-based models (Luo et al., 2022;Ma et al., 2022;Li et al., 2023).The batch size is 128 in the audio-video and 256 in the audio-text retrieval task, respectively.We conducted the experiment using 2 NVIDIA A100 80GB GPUs using the PyTorch library for 100 epochs, which took about 20 hours.
For A2V and V2A, each story of the SoS dataset consists of fairly long image sequences by using clustering from the video which has an average length of 215 seconds, unlike other datasets.In addition, our audio dataset includes various sounds to express a story and thus can better represent storylines compared to audio composed of simple sounds.
However, it also presents greater challenges.Therefore, compared to retrieving tasks using short videos and simple audio, it can be considered a challenging task.Nevertheless, with R@1 of 7.615 and 6.462 for both A2V and V2A, respectively, our baseline model performed better than its comparative model, Wav2CLIP, and showed reasonable evaluation results.
In addition, our text data consists of a storyline that implicitly represents the whole sequence, not a description of the scene, so it is challenging because the model has to understand the context of the scenes rather than a simpler understanding of the scene.Therefore, looking at the R@1 performance 7.462 and 7.077 in A2T and T2A, respectively, the performance of our proposed baseline model is better than the comparison model CLAP.Also, the results of retrievals can be found in our supplementary material or GitHub link2 .

Audio Generation Task
Composing and arranging background music and sound effects that exquisitely match the current scene's story is a highly challenging task that even human creators struggle (Thom, 1999).We propose an audio generation model called SoSGen, which is designed to generate appropriate audio for the movie scene, given the scene's descriptions, frames, or both as the condition.

Audio Generation Model
SoSgen takes advantage of the architecture of diffusion models (Sohl-Dickstein et al., 2015), which learn a data distribution through gradual denoising.Stable diffusion models (Rombach et al., 2022), a variant of diffusion models, enable effective model conditioning by training the model to learn latent features.Following the same approach, we train SoSgen as a text-to-audio spectrogram diffusion model, using the following objective function: where c is a trainable CLIP text encoder and d is the SoS description.
For image+text-to-audio training, we follow ControlNet (Zhang and Agrawala, 2023) architecture, a lightweight and powerful method that enables additional conditioning of a diffusion model by freezing the original model's parameters.We freeze the text-to-audio model's parameters and resume training using a similar objective function: where c img is an additional CLIP image encoder that embeds the SoS frame sequence f .

Implementation Details
Audio in movies rarely remains consistent until the end of a single scene.Also, using long audio as an input or output for a deep learning model is burdening in terms of hardware capacity.Thus, we slice the audio data into a length of approximately 15 seconds.Since extracted frames partially reflect semantic changes within the scene, audio slicing was performed treating audio data between two extracted frames as a base unit.The unit audios are accumulated and concatenated until their cumulative duration exceeds 15 seconds, ensuring that each sliced audio is never longer than 15 seconds.Audios that are originally longer than 15 seconds are not sliced because it implies the absence of significant visual changes for the duration, suggesting a high probability of long, consistent audio.Sliced audio files containing only a little sound were eliminated using the PyDub3 library.
To this end, we assessed the mean frequency of each segment and excluded any with a frequency below 78 Hz, a threshold determined through manual inspection.
This process conducted 185,341 audio data of roughly 15 seconds each, which were then converted into spectrograms.From this, we randomly split the dataset into 184,462 and 879 as training and testing, respectively.
For our base model, we leveraged Riffusion (Forsgren and Martiros, 2022), a checkpoint of Stable Diffusion v1-5 (Rombach et al., 2022), fine-tuned on pairs of music descriptions and spectrograms.Following the same approach, we finetuned the diffusion model from the Riffusion checkpoint with pairs of SoS descriptions and corresponding sliced audio spectrograms.In addition, we further trained the model in the setting of image and text conditioning, utilizing the Control-Net (Zhang and Agrawala, 2023) architecture.We use a sequence of frames extracted through clustering that correspond to the sliced audio as the image condition.SoS text+image was additionally trained from SoS text checkpoint, while SoS image resumed from the Riffusion checkpoint and used a redundant text condition in both training and testing.The specific model architecture can be found in Figure 5.
For training our models, we used a batch size of 1, 16 gradient accumulation steps, an initial learning rate of 1e-6, and a cosine learning rate scheduler.We fine-tuned the text-to-audio model for 406,000 steps and additional 525,000 steps to further train the text+image-to-audio model.Separately, the image-to-audio model was trained for 525,000 steps.We ran fine-tuning for 2 days and ControlNet training for another 2 days on a single NVIDIA GeForce RTX 3090 GPU.

Results and Analysis
Baselines We compare SoSgen to two baselines: Riffusion (Forsgren and Martiros, 2022) and MU-SICGEN (Copet et al., 2023), the state-of-the-art open source text-to-music generation model.The performance of the models is evaluated based on the Fréchet Audio Distance (FAD) (Kilgour et al., 2019).FAD is a reference-free evaluation metric that is computed using means and covariances of VGGish (Hershey et al., 2017) embeddings.We used a lightweight pytorch implementation4 of FAD for more efficient evaluation.

9.099
Table 3 shows the comparison of SoSgen against Riffusion and MUSICGEN.SoSgen text , Riffusion, and MUSICGEN generate an audio spectrogram given descriptions.For SoSgen image and SoSgen text+image , an image sequence is also given as an additional condition.
The results indicate that SoSgen text and SoSgen text+image outperform Riffusion and MUSIC-GEN in terms of FAD.Although SoSgen image only shows a slight improvement from Riffusion, combining two modalities exceptionally enhances audio generation capability.The generated audio samples can be listened to in our GitHub link5 and are also included in the supplementary material.The samples show SoSgen's superior performance on story audio generation tasks.

Conclusion
In this paper, we emphasize that sound also conveys meaningful semantics of the story.Thus, we propose a novel dataset named Sound of Story (SoS) dataset, audio-image-text pair dataset.One main feature of our dataset is that all the linguistic information inside the audio is eliminated.We named the audio without any linguistic information as "Background sound".
Background sounds can be utilized in various applications and we propose two related tasks, retrieval task and background sound generation task.To give an example of using it from videos such as movies and cartoons, background sound is often produced according to a designated scene using various props or objects for a sense of reality.In the case of the retrieval task, among the different sound sources made, it is possible to recommend the sound source which fits the best to the scene.Furthermore, background sound generation can produce more realistic audio while reducing human resource costs by directly generating the right audio for a given scene.To the best of our knowledge, it is the first attempt to do multimodal tasks with the speech-decoupled background sound relevant to the storyline.

Limitations
To use audio data in story understanding, there are some limitations to be solved.First, we create audio data by decoupling the speech to exclude lin-guistic information and preserving only pure audio.However, the generated audio may be yet distant from what people encounter in daily life because there is a possibility that various noises caused by the generation from the movie clip remain.Second, the audio-image-text pair is strongly connected, but it is not perfect due to the nature of the audio.Since audio includes a lot of subjective elements, various correct answers may exist for one story.To deal with these limitations, more strict and cautious policies to process not only audio but also image and text for story understanding should continue to be studied as future work.

Ethics Statement
One potential ethical concern from our proposed dataset is the ability of our model to generate audio resembling existing work as it may potentially pose copyright issues such as regarding background music.In addition, generated audio may be too uncomfortable or inappropriate to hear.

Figure 1 :
Figure1: Overall data generating process flow for our Sound of Story (SoS) dataset.All the modalities are extracted from the same movie clips.We exclude the speech from audio to make it contain only pure audio data, which does not contain linguistic information.
coach, the assistants, and Alabama players cheer for Forrest.Forrest runs across the field.He speeds past the defending players.Forrest runs past the opposite players.The football coach runs along the sidelines as he yells.Forrest … Jackie breaks into tears.They share a helpless half-laugh --then Frank appears in the doorway.She starts to hand him a large mirror from her purse --then thinks better of it.But he holds out his hand and she gives him the mirror.Simon holds … … … Dora parties with her friends over the end credits.Po's biological father Li shows up at his village.

Figure 2 :
Figure 2: Example of SoS dataset.CMD and LSMDC have different lengths of descriptions.
Figure 3: Distributions of each modality.(a) the number of extracted images from a movie video; (b) the duration of decoupled and processed audio for each story (c) the number of words in each story from CMD (d) the number of words in each story from LSMDC.Note that due to video clip concatenation to match video lengths, descriptions have larger lengths for stories from LSMDC.This text variation leads to the model requiring high-level text understanding.

Figure 4 :
Figure 4: Proposed retrieval model architecture.Encoded features are obtained from the encoder corresponding to each modality.Features are divided into context-level features and sequence-level features.Then we calculate the score for each task and use the score to calculate the final similarity Sim.

Figure 5 :
Figure 5: SoSgen architecture for training.The components and arrows drawn with dash lines are only used in the ControlNet-involved training, i.e. image-to-audio and image+text-to-audio.To train a text-to-audio model, we condition the denoising model with CLIP text embeddings of SoS descriptions.Once the text-to-audio model is sufficiently trained, we freeze the model and add image embeddings as a condition to train the text+image-to-audio model using ControlNet.Alternatively, an image-to-audio model can be trained from the Riffusion checkpoint, using a redundant text condition "Audio" for each training data.E denotes the encoder, D denotes the decoder, Z T denotes latent representation at timestep T .

Table 2 :
Retrieval performances with the proposed SoS dataset, which are measured on 1300 test samples.Random means the average of the results of five random retrievals.

Table 3 :
FAD (lower is better) of SoSgen and the baselines.SoSgen text+image is the only model given with the image and text condition.The rest are only given with SoS descriptions as the condition.