Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs’ powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our vision guidance method contributes 83.6% of the overall improvement. Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.


Introduction
Multimodal abstractive summarization (MAS) aims to take advantage of data from multiple modalities and provides a short, concise and readable textual summary to let users quickly acquire their essential information (Sanabria et al., 2018;Palaskar et al., 2019;. MAS has become an increasingly popular research area thanks to the proliferation of online multimedia content and the increasing availability of multimodal data. As input data, we show two representative video frames and the transcript, with [...] representing omitted unimportant text. As illustrated, some information is emphasized (e.g. the key of g flat) or only exists (e.g. piano) in the visual signal. We also compare the human-generated reference summary and our model-generated summaries with/without video frames in the input data.
As illustrated in Figure 1, the MAS models need to generate a concise summary by effectively utilizing two modalities: a video and its transcript. Therefore, we emphasize that leveraging a powerful text generation model and an effective combination of the vision and text modalities are key to constructing good MAS models. Recently, Transformerbased (Vaswani et al., 2017b) sequence-to-sequence (Seq2Seq) large-scale generative pre-trained language models (GPLMs), such as BART , T5 (Raffel et al., 2019), PEGASUS (Zhang et al., 2020a) and ProphetNet (Qi et al., 2020), have shown remarkable performance on text generation tasks, including abstractive text summarization. However, leveraging and adapting GPLMs to MAS is still an unexplored research direction. To explore this direction, two main questions need to be answered: Firstly, how can we inject visual information into the text-only GPLMs so that the models can understand both modalities and allow cross-modal interactions, and more importantly, how can this injection operation be conducted without damaging GPLMs' original text generation ability? Secondly, where is the optimal place in GPLMs to inject the visual information? This needs to be explored, as there are many sub-layers in the encoder and decoder of GPLMs and a sub-optimal location might result in unsatisfactory performance.
In this paper, to fill the research gap, we present a simple yet very effective method to construct vision guided (VG) GPLMs (VG-BART and VG-T5) for the MAS task. Specifically, to answer the first of the aforementioned questions, we insert attention-based add-on layers to GPLMs to incorporate visual information without modifying the original architecture. In this way, all the pre-trained model weights can be used during fine-tuning so as to preserve their original text generation ability. We try with two types of attention mechanisms for the text-vision fusion and interaction: 1) Cross-modal Dot-product Attention; and 2) Cross-modal Multi-head Attention. Moreover, we also investigate the effects of using a forget gate and a visual transformer encoder along with the attention mechanisms. To answer the second question, we enumerate almost all possible locations in GPLMs for injecting add-on layers, and show a thorough comparison and analysis in Section 5. We evaluate our models on the How2 dataset (Sanabria et al., 2018). Experimental results demonstrate that our best model surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores. To ensure this improvement does not purely come from the GPLMs, we also evaluate the corresponding textonly model, and the results show that the injected visual guidance contributes 83.6% of the overall improvement on average of all ROUGE scores.
Our contributions in this work are threefold: • To the best of our knowledge, we are the first to inject visual information into text-only GPLMs, and to use it for the MAS task.
• We systematically study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information?
• Our model significantly outperforms the stateof-the-art model on the How2 dataset, and the injected visual guidance contributes 83.6% of the overall improvement.

Abstractive Text Summarization
Abstractive text summarization aims to generate short, concise and readable text that can capture the most salient information of the input documents. Thanks to the Seq2Seq framework  and attention mechanisms, deep neural networks have achieved remarkable results on summarization tasks (Paulus et al., 2017;Zhang et al., 2020b;Yu et al., 2021). Recently, GPLMs Raffel et al., 2019;Zhang et al., 2020a;Qi et al., 2020) have been widely used in abstractive text summarization and have achieved start-of-theart performance. The most significant difference between abstractive text summarization and multimodal abstractive summarization lies in whether the input contains data of more than one modality.

Multimodal Abstractive Summarization
Recently, many studies have been performed on multimodal learning (Mroueh et al., 2015;Antol et al., 2015;Donahue et al., 2015;Zadeh et al., 2017;Dai et al., , 2021. However, only a few have investigated MAS. Li et al. (2017) collected a multimodal corpus of news articles containing 500 videos of English news articles paired with human-annotated summaries. Sanabria et al. (2018) introduced the How2 dataset, which contains about 2,000 hours of short instructional videos, each coming with a summary of two to three sentences. Palaskar et al. (2019) proposed a multi-source Seq2Seq model with hierarchical attention to integrate information from different modalities into a coherent summary. Meanwhile,  proposed a multistage fusion network with the fusion forget gate module, which can model the fine-grained interactions between multi-source modalities. To the best of our knowledge, no previous work has leveraged GPLMs' generation ability to tackle the MAS task, and we are the first to systematically study multiple multimodal fusion methods based on GPLMs.

Vision-Language Large Pre-trained Transformer Models
With the remarkable success of large-scale unsupervised pre-training in NLP (Devlin et al., 2019; ...

Positional Encoding
Multi-head Self-Attention  Figure 2: An overview of our proposed VG GPLMs. It is built based on the Transformer-based Seq2Seq GPLMs (left). To inject visual information, we insert add-on sub-layers (the green dashed block) by mainly leveraging two kinds of attention-based text-vision fusion mechanism (right): 1) Cross-modal Dot-Product Attention; and 2) Cross-modal Multi-head Attention. Although we draw the add-on sub-layers in the encoder, they can also be placed in the decoder in a similar way. We compare the effects of different injection locations in Section 5. Radford et al., 2019), pre-training large vision-language (VL) models has also become more and more popular in recent years. Rather than designing task-specific architectures, pre-training results in a general backbone model by feeding it with a large amount of data and then fine-tune it to different downstream tasks. Among the current VL pre-training work, most has been focusing on VL understanding by training BERT-style Transformer models (Sun et al., 2019;Tan and Bansal, 2019;Su et al., 2020; and finetune them on various VL classification tasks (Goyal et al., 2017;Zellers et al., 2019;Suhr et al., 2019). These models usually receive a pair of text and image as input, where the image is processed into objects (Zhang et al., 2021), patches (Kim et al., 2021), or pixels  before feeding into the VL model. For VL text generation,  presented a model for both visual question answering and image captioning (Chen et al., 2015). Additionally, Cho et al. (2021) introduced an encoder-decoder Transformer model that unifies all VL tasks as generative tasks. Although prior work has made much progress on VL pre-training, the problem of generating text given text and video input (E.g. the How2 dataset) is not well studied under the VL pretraining setting, except by Luo et al. (2020), who proposed a dual-stream model for both VL classification and generation with video data. However, compared to GPLMs in NLP such as BART  and T5 (Raffel et al., 2019), their text generation ability is limited as the training data is much smaller.
In this paper, we propose to tackle VL tasks and utilize the advantage of pre-training from a different angle by inserting add-on layers to the text-only GPLMs and fine-tuning them on multimodal tasks to incorporate visual information. This takes advantage of GPLMs' superior generation ability to generate vision-aware texts. Of the very few works that have also considered this direction, Rahman et al. (2020) proposed the multimodal adaptation gate, which fuses data of other modalities to the textual embeddings in BERT. However, their method requires all modalities to have the same sequence length, which is rare for most datasets. Additionally, they only attempted to address the sentiment analysis task and did not explore text generation.

Vision Guided GPLMs
To take advantage of the superior text generation ability of the text-only Seq2seq GPLMs and adapt them to the MAS task, we present Vision guided (VG) GPLMs. Specifically, we leverage BART  and T5 (Raffel et al., 2019) to construct VG-BART and VG-T5.
In this section, we start by revisiting the text-only Seq2seq GPLMs in Section 3.1. These serve as the backbone of our proposed model and also one of the baselines. Then, we discuss the approach for extracting visual features from video clips in Section 3.2, as well as how to further process them. Finally, in Section 3.3, we introduce two types of text-vision fusion mechanism to guide the GPLMs to generate vision-aware summaries.

Overview of GPLMs for Summarization
Transformer-based (Vaswani et al., 2017b) Seq2Seq GPLMs generalize architectures like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) by including a bi-directional encoder and a unidirectional (left-to-right) decoder. An overview of this architecture is depicted on the left side of Figure 2 (except the green dashed block).
At the entry of the GPLM, the input text is first tokenized and converted to a sequence of token embeddings ∈ R × , in which is the sequence length and is the feature dimension. To retain the positional information, positional encodings (Vaswani et al., 2017a) ∈ R × are added to the token embeddings pointwisely (Eq. 1), which forms the input features 0 to the encoder.
As illustrated in Figure 2, the encoder is composed of a stack of encoder layers, each containing two sub-layers: 1) Multi-head Self-Attention (MSA, Eq. 2) and 2) Feed-Forward Network (FFN, Eq. 3).
In addition, after each sub-layer, there is a residual connection (He et al., 2015;Wang et al., 2019) followed by a layer normalization (LN) (Ba et al., 2016). See Appendix A and B for more details of the MSA and FFN.
Similar to the encoder, the decoder also consists of a stack of decoder layers, but with two differences. Firstly, the MSA is masked to prevent positions from attending to subsequent positions (keep the decoder in a left-to-right direction). Secondly, there is one more multi-head encoder-decoder attention sub-layer, which uses the decoder embeddings to attend over the output embeddings of the encoder to incorporate the encoded information.
Specifically, in our experiments, we adopt the pretrained BART  and T5 (Raffel et al., 2019), which both follow this architecture with different training schemes. To fine-tune them on the abstractive text summarization task, the input to the encoder is the article or transcript, and the decoder learns to generate the summaries.

Video Feature Extraction
For each video clip, following previous works (Sanabria et al., 2018;Palaskar et al., 2019;Khullar and Arora, 2020), a 2048-dimensional feature representation is extracted for every 16 non-overlapping frames using a 3D ResNeXt-101 model (Hara et al., 2018), which is pre-trained on the Kinetics dataset (Kay et al., 2017). Therefore, each data sample will have a sequence of 2048-vision feature vectors of length . These features can be used directly as the visual input to the text-vision fusion mechanism.
In addition, in order to better model the intramodal dynamics and enhance the vision specific temporal information, we further process the extracted sequence of visual features using a Transformer (Vaswani et al., 2017a) encoder (VTF) with positional encodings. Experiments illustrate that this additional encoding process can further boost the performance of our model (Section 5).

Text-vision Fusion
As exhibited in Figure 2, we insert a third sub-layer (the green dashed block) into each encoder layer, which contains the text-vision fusion mechanism and also a residual connection followed by a layer normalization. We propose two types of text-vision fusion mechanism, as shown on the right-hand side of the figure. Given the textual input ∈ R × and visual input ∈ R × , the fusion mechanism produces vision guided output ∈ R × that has a same dimension as the textual input, which allows the continual stacking of layers.

Dot-product Attention Based Fusion.
Before performing dot-product attention between the textual and visual features, we first project the visual features to the same dimensional space as the textual features (Eq. 4). Then, we calculate the dot-product and apply the softmax function to get the attention score matrix (Eq. 5). Finally, the input textual features are concatenated with the attention weighted visual features and then projected by another linear transformation to output the vision guided textual features (Eq. 6).
Additionally, we build a variant of this fusion, which uses the linearly transformed visual features for the concatenation in Eq. 6 instead of the original . A comparison of their performance is shown in Section 5.

Multi-head Attention Based Fusion.
Inspired by prior works (Yu et al., 2019;Tsai et al., 2019), we propose a vision guided multi-head attention mechanism for the text-vision fusion. The query is linearly projected from the input textual features, and the key and value are linearly projected from the visual features (Eq. 7 -9). Then, a crossmodal multi-head attention (CMA) is applied to get the text queried visual features (Eq. 10). Finally, we obtain the vision guided output by concatenating the input textual features and , and linearly project it to the desired dimension (Eq. 11).
In addition, we also explore the effects of using a forget gate  in the text-vision fusion. Given the CMA output ∈ R × in Eq. 10, we construct a forget gate mask ∈ R × (Eq. 12) and do a point-wise multiplication with to output the updated (Eq. 13).
= Sigmoid(Concat( , ) ) = ⊗ The forget gate can potentially remove redundant and noisy information from the video features, which also helps the model to learn to discard needless visual information to retain its pre-trained text generation ability.

Implementation Details
Data pre-processing. We pre-process the transcripts data by truncating or padding them into sequences of 512 tokens after tokenization. For the videos, after the feature extraction as described in Section 3.2, we also truncate or pad the sequence length to 256.
Hyper-parameters. We use BART-base and T5base as the pre-trained GPLMs to construct VG-BART and VG-T5, in which = 6 for both encoder and decoder. For the VTF mentioned in Section 3.2, we use a 4-layer encoder with 8 attention heads and a 2048 feed-forward dimension. In the decoding stage, we use beam search with a beam size of 5. The decoding process will not stop until an endof-sequence (EOS) token is emitted or the length of the generated summary reaches to 64 tokens. Following  and Raffel et al. (2019), we use learning rates 6e −4 and 3e −5 to finetune the pre-trained parts of model weights. While for the newly added layers, we set the learning rate to 1.5e −4 . For all of our experiments, we use a batch size of 120.

Baselines
Apart from the text-only GPLMs BART  and T5 (Raffel et al., 2019), we use the following baselines to compare with our proposed models, including simple models that only accept text input, as well as prior state-of-the-art models that accept text and vision modalities. Luong et al., 2015). S2S is a standard Seq2seq model that uses RNNs for both encoder and decoder with a global attention mechanism (Bahdanau et al., 2014). et al., 2017). The pointer generator (PG) network augments S2S by having a copy module https://github.com/PyTorchLightning/ pytorch-lightning to reproduce key information accurately as well as mitigating the out-of-vocabulary issue.

TF (Vaswani et al., 2017b). TF is the standard
Transformer-based Seq2seq model, which proposes the novel multi-head attention mechanism.

MFFG (RNN/Transformer) (Liu et al., 2020).
The multistage fusion with forget gate (MFFG) model proposes a cross fusion block with forget gate and a hierarchical fusion decoder to improve multimodal generation.

Evaluation Metrics
Following , we use ROUGE, BLEU, METEOR, and CIDEr to evaluate the summaries. ROUGE-{1, 2, L} (the standard metrics for abstractive summarization) (Lin and Hovy, 2003) and BLEU-{1, 2, 3, 4} (Papineni et al., 2002) are used to calculate the recall and precision of n-gram overlaps, respectively, between the references and the generated summaries. MENTOR (Denkowski and Lavie, 2011) is used to match the word stems, synonyms and paraphrases between the reference and the generated summary. CIDEr  is an image captioning metric to compute the cosine similarity between TF-IDF weighted n-grams.
In addition, We use Content F1 (Palaskar et al., 2019) to measure the F1 score of the content words of the generated summary based on a monolingual alignment. Firstly, METEOR toolkit (Banerjee and Lavie, 2005;Denkowski and Lavie, 2014) is used to obtain the alignment between the summaries and references. Then, the function words and task-specific stop words are removed from the summaries and references. Finally, the remaining content words from the summaries and references are treated as two bags of words, and the F1 scores are calculated over the alignment. Content F1 focuses more on the content and it can avoid the increase of the ROUGE score from the stop words.
We use nlg-eval to compute the BLEU, MENTOR and CIDEr scores, and use rouge to compute ROUGE scores. The implementation of Content F1 scores follows (Palaskar et al., 2019).

Main Results
From Table 1, we can see that when there is only transcript in the input data, S2S and PG reach similar scores in terms of all evaluation metrics. This could be attributed to the fact that PG tends to copy the content in the transcripts while the reference summaries in the How2 dataset have a great number of novel n-grams, which are defined to be novel with respect to the transcript. We also observe that TF performs better than RNN-based models. It is because TF can learn better relationships between words by multi-head attention mechanism and positional embeddings. Furthermore, both text-only T5 and BART outperform all the baseline models by a large gap owe to their pre-trained text generation ability. Compared to T5, BART achieves higher scores mainly because it introduces a novel pre-training objective named sentence permutation.
https://github.com/Maluuba/nlg-eval https://github.com/ neural-dialogue-metrics/rouge Sentence permutation requires the model to generate the original uncorrupted text from randomly shuffled sentences, which enhances the understanding of long text and benefits the summarization task. Moreover, BART is even better than all previous multimodal models trained on transcript and video.
The visual guidance consistently boosts the performance of T5 and BART by a large step. As shown in Table 2, our best model VG-BART+FG+VTF with the cross-modal multi-head attention surpasses the previous state-of-the-art model (MFFG) by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores. The visual guidance contributes 83.6% of the overall improvement on average of all ROUGE scores.
The results of Content F1 scores in Table 1 show similar trends with other evaluation metrics. By injecting visual information, the models can generate summaries with much richer content. Table 2 shows that both forget gate (FG) and visual transformer encoder (VTF) benefit the model's performance. However, the Content F1 score is not boosted when combining FG and VTF together, which is contradictory to all other metrics. We conjecture that it is because the Content F1 focuses more on the content aspect, it may have some variance compare to other metrics.

How to Inject Visual Information
As illustrated in Section 3.3, we mainly adopt two text-vision fusion mechanisms to inject visual information, the cross-modal dot-product attention and multi-head attention. As shown in Table 1, for the VG-BART model, these two fusion mechanisms consistently improve its performance on all metrics by a comparable margin. However, for the VG-T5 model, the cross-modal dot-product attention based fusion does not show any improvement compared to the text-only T5, while the multi-head attention base fusion still increase its performance. We think there are two reasons behind this phenomenon. Firstly, as discussed in Section 5.1, BART leverages the sentence permutation method as its pre-training objective, which increases its robustness on attentionbased fusion. Secondly, multi-head attention can capture different key components in the visual information from multiple aspects, which makes it more potent than the dot-product based fusion. Additionally, as mentioned in Section 3.3, we build a variant of the dot-product attention based fusion, which achieves 66.  To ensure the visual features really help in the learning and our add-on layers aid the understanding of them, we conduct further experiments by replacing the visual features in the input data with random noise of the same dimension and sequence length. The noise is sampled from a uniform distribution from 0 to 3, in a similar value range of the original visual features. As depicted in Table 3, VG GPLMs with random noise as visual features achieve similar or slightly worse performance compared to the text-only GPLMs. This shows the effectiveness of our method to keep GPLMs' text generation ability. Furthermore, compared to the dot-product attention based fusion, the multi-head fusion is better at retaining GPLMs' performance, which again demonstrates its superiority.
As mentioned in Section 3, we use a forget gate (FG) to deal with the redundancy and noisy information in the visual features. Additionally, we further encode the visual features by a visual transformer encoder (VTF). Table 2 shows that using either FG or VTF can increase the performance of VG-BART. Jointly leveraging them boosts the performance by 1.7, 2.0, and 1.9 of ROUGE-1, ROUGE-2, and ROUGE-L, respectively.

Where to Inject Visual Information
As discussed in Section 1, one of the main challenges of building VG GPLMs is to find the optimal location to inject the visual information (i.e., the text-vision fusion). A sub-optimal location might lead to a less effective modality fusion and even hurt the GPLMs' original text generation ability.  Table 4: Performance of different text-vision fusion locations in the encoder of our best model (VG-BART+FG+VTF with cross-modal multi-head attention). indicates the occurrence of fusion at a certain layer and indicates non-occurrence. The first row is the result of BART using transcript only. and also the decoder, we explore this problem from two aspects: 1) which single layer has the best fusion effect; and 2) does multiple times of fusion help GPLMs to understand the visual information better?
As depicted in Table 4 and 5, firstly, we enumerate each single layer in the encoder and decoder of our best model (VG-BART+FG+VTF) to perform the text-vision fusion. In terms of ROUGE scores, we can clearly tell that injecting visual information into the encoder can generally boost the model's performance by a large step, while injecting into the decoder only shows negligible improvement. Furthermore, in the encoder, we observe that injecting at a higher layer (closer to the encoder output) brings more improvement. Instead, in the decoder, there is no clear pattern showing the influence of injecting location. We speculate that an early text-vision fusion in the encoder makes the visual information slightly fades away after passing through the stack of encoder layers. Additionally, during the decoding stage, the model utilizes visual information better through the encoder-decoder attention layers than directly injecting into the decoder, which could potentially hurts the generation ability. Secondly, as shown in the lower part of  locations. We observe that when fusing at all encoder layers simultaneously, the model converges to a much worse performance. We conjecture that this causes the catastrophic forgetting of the pre-trained knowledge in GPLMs. We find that fusing at the last several layers (e.g., 5 and 6) in the encoder is able to further improve the summarization performance.

Effects of the Forget Gate
As mentioned in Section 3.3, we apply a forget gate (Eq.12) to filter out noise and let the model focus on more important visual information. To have a deeper understanding of the effects of the forget gate, we calculate the average forget gate score (averaged over the whole sequence) for each sample from the How2 test set. As shown in Figure 3, most scores are distributed between 0.47 and 0.48. There is one data sample the score reaches 0.5 because its transcript is not available. As illustrated in Table 6, the model can still generate reasonable summary for it by paying more attention to the visual information. The meaning of the generated summary is still highly aligned with the reference summary, which shows the capability and flexibility of our model to utilize visual information.

Conclusion and Future Work
In this paper, we introduce a simple yet effective method to construct vision guided large-scale generative pre-trained language models (VG-BART and VG-T5) for the multimodal abstractive summarization task by inserting attention-based add-on layers. We propose two types of attention mechanisms for the text-vision fusion and interaction: 1) Cross-modal Dot-product Attention; and 2) Crossmodal Multi-head Attention. Moreover, we also Transcript: transcript not available Summary from Transcript + Video: learn tips on how to write "cane" in chinese radicals with mandarin characters in the free video clip. get free foreign language lessons from an expert. Reference Summary: learn what ticks are in chinese calligraphy in this free video clip on languages and writing.  investigate the effects of using the forget gate and visual transformer encoder along with the attention mechanisms. In addition, we enumerate almost all possible locations in GPLMs for injecting addon layers. Experimental results show that our approaches significantly outperform the prior stateof-the-art on the How2 dataset. Further analysis illustrates that multi-head attention is more robust than the dot-product attention and higher layers of the encoder is the optimal place to inject vision information. For future work, we believe that our analyses on the how and where to inject visual information into GPLMs can be applied to other multimodal tasks.
input ∈ R × , we calculate , , and by in which ∈ R × , ∈ R × , and ∈ R × are the projection weights. Then, a singlehead self-attention is calculated by where 1 √ is the scaling factor to mitigate the extremely small gradients issue mentioned by Vaswani et al. (2017b). For multi-head self-attention, it can be calculated by MultiHead( , , ) = Concat(head 1 , ..., head ℎ ) and head = Attention( , , ).

B Feed-Forward Network
Given the input ∈ R × , the feed-forward network (FFN) processes it with two linear projections 1 ∈ R × , 2 ∈ R × and a non-linear function GELUs (Hendrycks and Gimpel, 2016), In addition, after each linear projection, there is a dropout (Srivastava et al., 2014) layer to improve generalization.