Learning to Imagine: Visually-Augmented Natural Language Generation

People often imagine relevant scenes to aid in the writing process. In this work, we aim to utilize visual information for composition in the same manner as humans. We propose a method, LIVE, that makes pre-trained language models (PLMs) Learn to Imagine for Visually-augmented natural language gEneration. First, we imagine the scene based on the text: we use a diffusion model to synthesize high-quality images conditioned on the input texts. Second, we use CLIP to determine whether the text can evoke the imagination in a posterior way. Finally, our imagination is dynamic, and we conduct synthesis for each sentence rather than generate only one image for an entire paragraph. Technically, we propose a novel plug-and-play fusion layer to obtain visually-augmented representations for each text. Our vision-text fusion layer is compatible with Transformer-based architecture. We have conducted extensive experiments on four generation tasks using BART and T5, and the automatic results and human evaluation demonstrate the effectiveness of our proposed method. We will release the code, model, and data at the link: https://github.com/RUCAIBox/LIVE.


Introduction
Natural language generation (NLG) is a fundamental technique for supporting a variety of downstream applications (Li et al., 2022b;Zhao et al., 2023), e.g., text summarization, story generation, and data-to-text generation. As the mainstream NLG approach, pre-trained language models (PLMs) can produce human-like text under the guidance of input conditions. Despite their success, these models are pre-trained on the text-only corpora, and they cannot well capture visuallygrounded semantics, e.g., visual commonsense (Ilharco et al., 2021), making it difficult to achieve desired results when visual knowledge is required. Corresponding author To improve the generation capacity of PLMs, existing work has widely explored various methods to incorporate visual knowledge into models, which can be roughly divided into two lines of research. The first line designs specific visually-enhanced training tasks such as continual pre-training on text-image data  or knowledge distillation with vision-language models . However, these methods usually perform well only on multimodal generation tasks (e.g., visual question answering) but not text generation tasks, due to the semantic disparity across modalities (Tan and Bansal, 2020). As the second line, several studies retrieve or synthesize images related to the input and then fuse the image representations into PLMs (Wang et al., 2022b;. However, they simply treat the input as a whole (even for long texts) for retrieving or synthesizing related images, which cannot sufficiently leverage fine-grained visual semantics.
Considering the above issues, we are motivated by the process of human writing where they have the ability to imagine relevant scenes from the contexts in their minds. These visual scenes convey related experiences in the world that can inspire the human's writing (Bisk et al., 2020;Popham et al., 2021). By imitating such behavior, we consider NLG as a writing process of a human, where the input text is conditioned on a set of dynamically "imagined scenes", i.e., synthesized images.
To this end, in this paper, we propose a novel approach, LIVE, that enables PLMs to Learn to Imagine for Visually-augmented natural language gEneration. Different from previous methods, our augmentation approach is relevant, selective, and dynamic. To be relevant, we utilize the state-of-theart text-to-image model, Stable Diffusion (Rombach et al., 2022), to synthesize realistic images for fine-grained semantic units (i.e., sentences). Compared to the retrieval-based approach, our method can generate more relevant, diverse images that  Figure 1: The overall illustration of our proposed approach LIVE, consisting of the text-related image generation and the plug-and-play vision-text fusion layer. The fusion attention mask means that the first sentence x 1 lacks visuality and will skip the fusion layer (green flow), while the second sentence x 2 has high visuality, and each word x 2i of x 2 will attend to the synthesized image to obtain visually-augmented text representations (red flow).
may not exist in real-world image databases. To be selective, we evaluate the degree to which the text's meaning can be visualized in an image and only invoke the use of synthesized images when it is actually needed. To be dynamic, we synthesize images for each sentence in the input text so that the visual knowledge is more fine-grained compared to a single image for the whole input. In order to deeply fuse the visual knowledge of synthesized images, we propose a plug-and-play vision-text fusion layer for Transformer-based models. We also design specific mechanisms to support efficient text-image cross-attention and enable the controllability of the use of visual knowledge. Our contributions are summarized as follows: • We propose a new approach, called LIVE, to learning to use synthesized images to improve natural language generation, imitating the process of human writing.
• We propose a plug-and-play vision-text fusion layer to incorporate visual knowledge and obtain visually-augmented text representations.
• We verify the effectiveness of our approach with BART and T5 on four text generation tasks: LIVE consistently outperforms these PLMs, with an average improvement ratio of 2.44%.

Related Work
Pre-Trained Models. In recent years, large-scale pre-training has achieved remarkable success and has become the dominant technique in the NLP community (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020;Zhao et al., 2023). Pre-trained on massive text corpora, models can learn contextualized representations that include both linguistic and world knowledge . Since PLMs are trained with pure text corpora without connection to the visual world, vision-language pretraining (VLP) leverages image-text pairs to learn cross-modal representations (Gan et al., 2022;Su et al., 2020;Radford et al., 2021). It has been discovered that VLP models have more visual knowledge than PLMs (Ilharco et al., 2021), however, they cannot perform well on text-only tasks such as language understanding (Yun et al., 2021). In this work, we mainly focus on incorporating visual knowledge to enhance the performance of natural language generation tasks based on existing text-only models.
Visually-Augmented Language Learning. Considering the leakage of visual knowledge in language models, many researchers attempt to en-hance text-only tasks with visual information, which is known as visually-augmented (aided or grounded) language learning. Vokenization (Tan and Bansal, 2020) and iACE  propose to treat contextualized-related images as vokens and pre-train a text model to predict them for fusing visual information. Similarly, Vid-LanKD (Tang et al., 2021) extends finite image vokens to diverse video frames and employs a knowledge distillation method to acquire visual knowledge. The subsequent works leverage CLIP (Radford et al., 2021) as the vision source to integrate visual information into PLMs via CLIP output embeddings (Wang et al., 2022b;Guo et al., 2022) or knowledge transfer methods Jin et al., 2022). The majority of these works can outperform PLMs on language understanding tasks. As for natural language generation tasks, researchers mainly focus on finding suitable images and fusing the visual representations into text-only models using a shallow module. Some works apply generation models, such as GAN-based models (Long et al., 2021; and VAEbased models (Fang and Feng, 2022), to synthesize (latent) images, while Liang et al. (2021), Shen et al. (2021), and Su et al. (2022) propose to employ contextualized text embeddings to retrieve relevant images. In our work, we utilize the superior diffusion model (Rombach et al., 2022) to synthesize high-quality images and propose a plugand-play vision-text fusion layer to deeply integrate visual knowledge into PLMs and obtain visuallyaugmented text representations.
Multimodal Language Generation. Multimodal language generation aims to produce fluent and coherent text based on the input text or image. Different from unimodal language generation, the additional image serves as the background for generation. Multimodal language generation includes tasks such as image caption (Lin et al., 2014), visual question answering (Zhang et al., 2016), multimodal machine translation (Elliott et al., 2016), multimodal text summarization (Jangra et al., 2021), visual dialog (Das et al., 2017), and visual story telling (Huang et al., 2016). However, the construction of these datasets requires costly manual annotation, which hinders their widespread application. In contrast, we do not require text-image pairs as input and instead utilize Stable Diffusion (Rombach et al., 2022), a text-to-image model, to synthesize images for input texts.

Task Formulation
Natural language generation (a.k.a., text generation) aims to capture the semantic mapping relation from an input text X = ⟨x 1 , ..., x k , ..., x m ⟩ to an output text Y = ⟨y 1 , ..., y k , ..., y n ⟩, where x k and y k denote the k-th sentences of the input and output texts, respectively. In this paper, we focus on the task of visually augmented natural language generation (VA-NLG). Following prior works Wang et al., 2022b), VA-NLG further assumes text-related image data can be obtained to help text generation. Here, we consider a generalized way (e.g., retrieval and synthesis) to obtain the related images with an image augmenter F, where F takes as input a sentence x (or a text) and outputs an image i x related to x: The goal of VA-NLG is to generate readable and plausible output texts Y based on input texts X and image augmenter F, which is formally defined as: where y <k denotes previously-generated sentences. With this formulation, there are two key issues for this task: (1) how to design the image augmenter to obtain potentially useful images, and (2) how to use the augmented images for improving text generation. Considering the two issues, we propose LIVE, a general approach to augmenting NLG tasks with related images, with sentence-level image synthesis via text-to-image diffusion model (Section 3.2) and plug-and-play vision-text fusion for using augmented images (Section 3.3).

Text-Related Image Generation
Although it is intuitive to augment PLMs with visual images, it is challenging to obtain appropriate and helpful images for given texts. Some previous work Tan and Bansal, 2020) utilizes retrieval-based methods to search images from text-image databases, such as MS COCO (Lin et al., 2014). However, these static image resources are limited in both quantity and content, which is likely to result in inaccurate image retrieval.
Synthesizing Relevant Images. To circumvent the limitation of static image resources, we instead propose to automatically generate images for given texts by leveraging text-to-image generation models. In contrast to previous works that utilize GAN-based (Esser et al., 2021) or auto-regressive (Wang et al., 2022a) generation models, we use the stateof-the-art Stable Diffusion model (Rombach et al., 2022), a probabilistic diffusion model guided by CLIP-encoded input text representations, to synthesize high-quality images. With Stable Diffusion, we can flexibly perform image generation based on different text units. Here, we consider sentences as synthesis units, which contain a moderate amount of information for an image. Compared with the previous work that synthesize a single image for the whole input, our sentence-level generation is more fine-grained. It is inspired by the writing behavior of people: one would switch the imagined scenes for different sentences.
For each input sentence x k , we apply Stable Diffusion to synthesize its corresponding creative image i x k . Equipped with the acceleration method of DDIM , Stable Diffusion is able to synthesize photographic images normally in 50 steps (Rombach et al., 2022). In practice, we empirically find that using a 25-step synthesis can usually lead to a decent performance in our task (see Section 5.4 for more analysis about the diffusion quality and efficiency).
Evaluating the Text Visuality. Although the generation-based method is flexible to produce images on various topics, not all texts can inspire the generative model to generate meaningful images, such as the rule text "ACL 2023 requires all papers to have a clear discussion of limitations". Only texts with visually rich content can be associated with images. Previous work usually synthesizes or retrieves images without considering the visuality of texts, tending to incorporate irrelevant or noisy images. However, it is difficult to directly measure the visuality of a text. As a compromise, we estimate the similarity score in a posterior way between a sentence x k and a synthesized image i x k using CLIP (Radford et al., 2021): CLIP is a vision-language model pre-trained on a massive amount of text-image pairs using contrastive learning which excels at evaluating the similarity between text and image. In our work, we manually set a threshold value θ. If γ exceeds the threshold value, the text is considered to have high visuality; otherwise, we consider that the text has weak visuality and discard the synthesized image. We will discuss the influence of θ in Section 5.3.

Plug-and-Play Vision-Text Fusion
After synthesizing relevant images for given texts, we study how to leverage visual images for improving text generation. Instead of using VLP models, we aim to fuse the visual knowledge into a PLM-based backbone, since text generation is essentially a language modeling task. To enhance the cross-modality fusion, we propose a plug-and-play vision-text fusion module to obtain deeply-fused visually-augmented text representations.
Vision-Text Fusion for PLMs. Our fusion module is a plug-and-play attention layer for Transformer-based (Vaswani et al., 2017) models, such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020). We insert the shared fusion layer after the self-attention layer in the encoder. Our fusion layer is a layer-wise cross-attention module to augment the word representations with visual information. In particular, for a sentence x k and the corresponding synthesized image i x k , we first utilize CLIP to encode the image into patch representations I k ∈ R p×d . Then, we feed the sentence into the Transformer model and obtain the output representation S k,l for the self-attention sub-layer in the l-th layer of the encoder. Finally, we pass S k,l to our l-th plug-and-play fusion layer to obtain the visually-augmented text representations: where γ is the similarity score computed in Equation 2, and FusionLayer l conducts multi-head attention on the query, key, and value matrices, followed by residual connection and layer normalization. Here, we introduce γ to control whether a generated image will be used or not.
In general, such a fusion layer can be applied to various Transformer-based PLMs and LLMs. Note that each sentence attends to no more than one image, as depicted in the attention matrix in Figure 1. Compared to simply concatenating images and text as input (Liang et al., 2021), our cross-attention-based mechanism is more efficient while maintaining performance (see Section 5.2). Besides, our fusion is more controllable and can achieve fine-grained cross-attention. For example, we can choose only nouns to be attended with images since they contain more visual information (see Section 5.2).

Optimization
In order to achieve decent performance, we can pre-train the key component of our approach, i.e., the fusion layer (Section 3.3), with text-image paired datasets. Specially, we collect the image caption datasets MS COCO (Lin et al., 2014), Flickr30k (Plummer et al., 2015, CC3m (Sharma et al., 2018), and Visual Genome (Krishna et al., 2017) as text-image pairs, and utilize the caption text to synthesize images using Stable Diffusion to enrich the pre-training pairs. In this way, we can obtain 9 million text-image pairs in total. Then, we apply image-based denoising autoencoding as the pre-training objective, which teaches the model to recover the caption based on a noisy text. Such a pre-training strategy can make the fusion layer better map the visual knowledge into text space.
Next, we describe the overall optimization process of our approach. During pre-training, we freeze the PLM backbone and only pre-train the fusion layer; therefore, if we plug-out the fusion layer, the PLM retains its original language generation ability. The fusion layer is a lightweight module and has 18M parameters for BART BASE (140M). During fine-tuning, we utilize Stable Diffusion and CLIP models to synthesize images and compute similarity scores. These operations can be done offline for efficiency, and the diffusion and CLIP models will not be updated. We only need to fine-tune the whole PLM as usual, in addition to the small pre-trained fusion layer.

Dataset
We conduct experiments on four text generation datasets with diverse tasks and domains: • E2E (Novikova et al., 2017) is a data-togeneration task with the aim of converting multiple input meaning representations into fluent texts.
• CommonGen (Lin et al., 2020) requires the model to generate a coherent and reasonable text given a collection of common concepts.
• SAMSum (Gliwa et al., 2019) is a dialogue summarization dataset that evaluates the model's summary and dialogue understanding abilities.
• ROCStories (Mostafazadeh et al., 2016) consists of five-sentence stories, and we utilize the first sentence as input to generate the remaining four.
The details of the statistics and license of each dataset are listed in Table 1. For each dataset, we utilize NLTK 1 to tokenize the input texts into sentences, except that we treat each key-value pair in the input as a sentence for the E2E dataset.

Evaluation Metrics
We adopt five automatic metrics, namely BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015), SPICE (Anderson et al., 2016), and Distinct (Li et al., 2016), to compare the performance of different methods. BLEU, ROUGE, and CIDEr compute the n-gram overlap between the candidate text and the reference text(s). SPICE further takes semantic meaning into consideration. Distinct mainly evaluates the diversity of the generated texts and is always used in open-ended generation tasks, such as story generation. We also conduct the human evaluation in Section 5.5.

Baseline Models
We utilize two commonly used text generation PLMs, BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), as text-only baselines. We further compare them to two multimodal VLP models: • BLIP (Li et al., 2022a) uses a multimodal mixture of encoder-decoder with the objectives of textimage contrast, text-image matching, and language modeling on bootstrapped text-image pairs.
• OFA (Wang et al., 2022a) unifies text and image modalities using a unified architecture and multi-task sequence-to-sequence learning. In addition, we consider a variant and attempt to use OFA with only text, denoted by OFA w/o image.
We integrate our LIVE framework with BART and T5, and consider the following visuallyaugmented methods as comparisons: • VL  adds visual embeddings for the original BART and T5 and conducts continued pre-training on text-image pairs.
• iNLG  guides the PLM with the machine-generated image as the visual prefix.

E2E
CommonGen SAMSum ROCStories  Since iNLG does not offer a T5 version, we can only combine it with BART for comparison.

Implementation Details
For all baselines, we utilize the base versions of PLMs, i.e., BART BASE , T5 BASE , BLIP BASE , and OFA BASE , which have a comparable number of parameters to ensure a fair comparison. For BLIP, OFA, VL-BART, and VL-T5, we provide the same synthesized image as our method, and we fine-tune them similarly to how they perform VQA tasks. For iNLG, we utilize its official implementation 2 . As for our method, we employ Stable Diffusion v1.4 with half precision 3 to synthesize images in 25 timesteps for efficiency. Then, we adopt CLIP-ViT-B/32 to judge the similarity between text-image pairs and extract image features. We empirically set the threshold value θ = 0.27. After extraction, an MLP layer is appended to project the image representation into the text space and obtain an image representation I i ∈ R 50×768 . The aforementioned operations can be performed offline for efficiency.
In the pre-training stage of our fusion layer, we mask 50% of the input text with span lengths drawn from a Poisson distribution with λ = 3.5 for BART and force the model to recover the input with the image. As for T5, we split the caption into two parts and train the model to generate the second part using the first part and the image. We pretrain the fusion layer with a batch size of 384, optimize BART using AdamW (Loshchilov and Hutter, 2019) with a constant learning rate of 1×10 −5 , and optimize T5 using Adafactor (Shazeer and Stern, 2018) with a learning rate of 1 × 10 −3 .
In the fine-tuning stage, we tune the entire model, including the PLM backbone and the fusion layer. We set the batch size to 32 and employ the same optimizer and learning rate as in pre-training. We optimize the model using crossentropy sequence-to-sequence loss with a label smoothing factor (Szegedy et al., 2016) of 0.1. During inference, we choose the checkpoint with the highest validation metric score for generation. During generation, we apply beam search with a beam size of 5 for E2E, CommonGen, and SAMSum, while utilizing the nucleus sampling with p = 0.9 and t = 0.7 for ROCStories.
All the experiments are conducted using the text generation library TextBox (Tang et al., 2022) on NVIDIA GeForce RTX 3090 24GB GPUs using Ubuntu 20.04.1 SMP. All these hyper-parameters are identical for our method and baselines.

Experimental Results
Based on the results in Table 2, we can find that: Firstly, the results of multimodal models (i.e., BLIP and OFA) cannot achieve satisfactory results when compared with text-only models (i.e., BART and T5) on pure text tasks. This finding further proves the existence of semantic disparity (Tan and Bansal, 2020) across modalities of generation tasks. OFA without images even outperforms OFA with images slightly, which indicates that images may be a burden for text generation tasks when the fusion method is not appropriate.
Secondly, the visually-augmented methods (i.e., VL-BART, VL-T5, and iNLG) can achieve superior performance than their base PLMs on certain tasks but cannot achieve overall improvement on all tasks. A major reason might be that they synthesize only one image for each input without considering  its relevance and sentence-level semantics.
Finally, our LIVE method can outperform all baselines on all four text generation tasks. Equipping BART with our LIVE method, LIVE-BART can outperform its text-only counterpart BART by 2.80% in ratio. LIVE can also work with T5, yielding an average improvement of 2.08%. These automatic results demonstrate the effectiveness and compatibility of our text-related image generation approach and plug-and-play fusion layer.

Further Analysis
In this section, we conduct various experiments to test the efficacy of our methods. The tuning details are identical to those introduced in Section 4.1.4.

Few-Shot Results
We investigate the performance of our LIVE methods in a low-resource situation. We keep 0.1%, 0.3%, 1%, and 3% of the training set for the E2E dataset. For each split, we choose five independent groups to decrease the randomness. From the results in Table 3, we can observe that our methods remarkably boost the performance under few-shot settings compared with baselines, especially in extreme situations (0.1% and 0.3%). We assume that synthesized images can provide visual knowledge as a supplement when training data is scarce.

Ablation Study
To examine the effectiveness of the different factors of our LIVE methods, we conduct four groups of experiments for ablation. The results are reported in Tables 4 and 5. First, we can see that the pretraining of the vision-text fusion layer is beneficial.
Second, we replace the image augmenter F Stable Diffusion with two variants: a text-image retriever CLIP (Radford et al., 2021) and a text-toimage synthesizer VQGAN (Esser et al., 2021). We can find that the synthesis-based methods are superior to the retrieval-based ones since they can generate relevant images which may not exist in a static database. Compared with VQGAN, Stable   Diffusion can synthesize high-quality images and provide more visual knowledge for text generation. Third, we investigate the fusion method of visual representations and make two variants of our crossattention-based fusion. "Concatenation" means to concatenate the image representations and the encoder output as the input for the decoder, while "Self-attention" means to concatenate the image representations and the text representations as the input for the encoder. The results indicate that the deep fusion of text and vision representations is beneficial and the cross-attention-based method and self-attention-based method are comparable, which is consistent with Gan et al. (2022). Thus, we utilize cross-attention as the fusion method because it is more efficient and controllable.
Finally, we explore our dynamic and controllable fusion layer. To be dynamic, we synthesize one image for each sentence in the input (denoted as "Sentlevel") and attempt two variants that synthesize one image for the whole document ("Doc-level") or each word in the document ("Word-level"). The re- sults prove the effectiveness of our sentence-level synthesis compared with previous method ) that only generates one image for the input. However, too many images actually lead to poor performance. In addition, we investigate a fine-grained cross-attention based on sentencelevel synthesis ("Selective sent-level"). We only make noun words visually-augmented and make the other words skip the fusion layer. The results show that the fine-grained fusion may be promising, and we leave it for future work.

Model Sensitivity w.r.t. the Similarity Threshold Value θ
In Section 3.2, we set a threshold value θ to measure the text visuality. Here, we investigate the model's performance when θ varies. If θ = 0, all the sentences will be visually-augmented. If θ = 1, all the sentences will not be visually-augmented, and it degenerates to text-only BART. As shown in Figure 2, LIVE-BART with θ = 0.27 achieves the best performance, and we find that 0.27 is close to the median of text visuality scores, i.e., nearly half of the sentences will be augmented and the others will not be. Therefore, we set θ = 0.27 for our LIVE methods in experiments.  In this subsection, we first demonstrate that visual information is truly favorable for text generation. Following the previous works , we replace the image representations with random noise or utilize the input text as a negative prompt to synthesize irrelevant images. The results in Figure 3 further prove the necessity of visual knowledge for text generation. Moreover, we vary the number of diffusion steps since it is a trade-off between synthesis quality and efficiency. Surprisingly, increasing the diffusion steps will not lead to performance gains. We speculate that diffusion with certain steps can provide enough visual knowledge for the PLM, and more steps may just help to achieve higher resolution. Thus, we only synthesize for 25 steps considering the efficiency.

Human Evaluation
Considering that the automatic evaluation may be inconsistent with human judgments, we further invite five college students to assess the generated texts. We randomly choose 100 samples from the test set of each dataset and showcase the generated texts of both BART and LIVE-BART. The annotators should choose which one is better or choose a tie based on their subjective feelings. From the results in Table 6, we can observe that our LIVE method can make BART generate more satisfactory texts in all tasks.

Conclusion
In this paper, we present the LIVE method for natural language generation. First, we propose an imagination-based method, imitating the process of human writing. It is a relevant, selective, and dynamic approach that leverages Stable Diffusion to synthesize images for each input sentence and discard the images with lower text visuality computed by CLIP. Furthermore, we introduce a plug-andplay vision-text fusion layer to deeply incorporate visual knowledge into PLMs and obtain visuallyaugmented text representations for text generation.
Extensive experiments have demonstrated that our LIVE methods are compatible with two PLMs (i.e., BART and T5) and can achieve superior performance over all the baseline models.
In future work, we will investigate how to synthesize more relevant images based on the input prompt and design a finer fusion method for better aligning different words and images. We will also attempt to extend our methods to more tasks (e.g., language understanding) and PLMs (e.g., BERT). Besides, it is meaningful to explore the probability of combining our LIVE method with existing large language models (Zhao et al., 2023) to enhance their representation and generation capabilities.

Acknowledgment
This work was partially supported by National Natural Science Foundation of China under Grant No. 62222215, Beijing Natural Science Foundation under Grant No. 4222027, and Beijing Outstanding Young Scientist Program under Grant No. BJJWZYJH012019100020098. Xin Zhao is the corresponding author.

Limitations
We only conduct experiments on four natural language generation tasks without considering the expandability to more NLP tasks, such as language understanding or reasoning. It is also meaningful to investigate the robustness of our methods with different text formats (e.g., text length and literary form), i.e., examine which situations and why our methods can achieve better performance. Due to the limitation of computing power, we do not explore the effectiveness of our methods under different PLMs with various scales. Besides, we utilize CLIP to evaluate the text visuality and encode images into representations, and this is also interesting to research which vision encoder has higher suitability with PLMs.