Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.


Introduction
With the increase of multimedia data on the Web, multimodal summarization has drawn widespread attention from researchers in the communities of Web technologies (Messaoud et al., 2021;Jangra et al., 2021a), natural language processing (NLP) (UzZaman et al., 2011;Li et al., 2017Li et al., , 2020b) ) and computer vision (CV) (Chen and Zhuge, 2018;Palaskar et al., 2019;Li et al., 2020a;Liu et al., 2020).More recently, many efforts (Zhu et al., 2018(Zhu et al., , 2020;;Zhang et al., 2021b) have been ded- icated to multimodal summarization with multimodal output (MSMO), the novel task of generating pictorial summaries given a Web document consisting of plain text and a collection of images.As shown in Figure 1, a pictorial summary generated by MSMO models consists of a text summary and a salient image, delivering more user-friendly information than single-modal text summaries, according to human judgments (Zhu et al., 2018(Zhu et al., , 2020)).
MSMO faces two main challenges.(1) There are no recommended image references available for training MSMO models.Due to the lack of supervision signals from visual modality, it is nontrivial to optimize the cross-modal attention between texts and images, which is highly relied on by existing MSMO methods to pick salient images.According to previous best results (Zhang et al., 2021b), only about 60% of the predicted images are correct, indicating that image selection remains a bottleneck.(2) Visual knowledge is commonly underutilized to improve text summaries.Existing MSMO efforts show no evident improvement or even negative impact on text summaries (e.g., decreased ROUGE scores) over typical singlemodal text summarization methods.Previous literature (Zhu et al., 2018) explained that some images were noises and long text had contained enough information for text generation, while we conjecture that these methods may not sufficiently exploit visual knowledge to characterize salient text.
To summarize, previous efforts typically encode images and texts into the same semantic space, struggling with optimizing cross-modal interaction without training signals for image selection, as the red box in Figure 1 shows.In this dilemma, image captions, which naturally connect images and texts, can provide a cross-modal semantic bridge.Indeed, our preliminary experiments show the efficacy of introducing imageIn captions (see Section 4.4).Yet, exposure of image captions during training is inconsistent with MSMO's task settings, since MSMO excludes them to pursue better generalization of cross-modal semantic modeling (Zhu et al., 2018(Zhu et al., , 2020)).On the other hand, however, it inspires us to identify a highly-relevant sentence for an image as its pseudo yet meaningful caption, providing us with a new perspective to improve MSMO.As shown in the blue box in Figure 1, unlike current works that represent an image as an intermediate state, we transform it into a concrete sentence to better capture visual knowledge under MSMO settings.This transformation presents an opportunity to incorporate pre-trained visual-language models more smoothly, while making further text summarization and image selection extremely simple.
Aligning a sentence with an image could be straightforward, but identifying sentences benefiting MSMO the most is non-trivial.The reasons are two-fold.(1) A sentence well aligned with an individual image can not guarantee a suitable one for MSMO.An intuitive way to select a sentence is to simply retrieve it from the document, with the image as the query of a pre-trained cross-modal retrieval model.Unfortunately, we find this manner yields unsatisfactory MSMO performance (see Section 4.3).( 2) A classical single-pass one-to-one alignment strategy may miss salient sentences for summarization (see Section 4.2).There can be one-to-many and many-to-one relationships between images and sentences, and images can be similar in a document, so we need to synthesize yet distinguish image semantics from a global perspective to make better MSMO-oriented alignment.
To this end, we design a coarse-to-fine imagetext alignment mechanism to produce pseudo image captions for MSMO.Firstly, a reference caption for an image is retrieved with a cross-modal retrieval model from the golden summary, rather than the whole document (Section 2.3), to capture more summary-friendly information.Since no golden summary exists at inference time, these reference captions are used to train a two-pass image-text alignment model (Section 2.4) that yields pseudo captions when making inferences (that's why "reference captions" are so named).Given a document with ten images, for example, we will first synthesize them as a whole to select ten sentences with many-to-many coarse-grained alignment, and then identify ten individual one-to-one fine-grained matchings by bipartite graph matching over the cross-modal attention matrix.
The pseudo image captions that imply visual knowledge are used as extra highlighted features for text summarization (Section 2.5), and the salient image is picked based on the ROUGE score between its pseudo captions and the generated summary (Section 2.6).Extensive experiments on an existing MSMO dataset not only verify the superiority of our method but also reveal the inner connection between image captions and summaries, demonstrating promising research opportunities for our novel perspective of bridging the cross-modal semantic gap by generating pseudo image captions.

Problem Formulation
For MSMO task, the input is a multimodal document {T, V } including a text document T with m sequential sentences, where and a image collection V with n images, where The output is a multimodal summary {S, v} where S = [s 1 , s 2 , • • • , s l ] is a text summary containing l generated sentences and v is the image selected from V .

Method Overview
Our method, named SITA, refers to a multimodal Summarization model based on a coarseto-fine Image-Text Alignment mechanism.SITA consists of four modules: (1) Reference Caption

Reference Caption Retrieval
Given the multimodal document {T, V }, we first retrieve reference captions from the golden text summary for each image in V , based on a pretrained cross-modal retrieval model consisting of an image encoder and a text encoder.The image encoder is ResNet152 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) and the text encoder is a BERT-based sentence encoder for text summarization (Liu and Lapata, 2019).Following (Faghri et al., 2018), we train the model on the COCO dataset (Lin et al., 2014) by matching image representations and sentence representations.
We retrieve reference image captions from the golden summary rather than the whole document, to make the retrieval results more summaryfriendly and narrower-focused (see Section4.3).However, a new dilemma is the lack of golden summaries during inference.Therefore, we exploit them to train an image-text alignment model, which predicts pseudo captions during inference.

Coarse-to-fine Image-Text Alignment
We design a coarse-to-fine Image-Text Alignment model (ITA) with training signals obtained from reference captions, to generate pseudo image captions.Since there can be one-to-many and many-to-one relationships between images and sentences, employing a simple single-pass one-to-one alignment strategy tends to generate a limited set of aligned sentences repeatedly, incapable of recalling enough relevant sentences (see Section 4.2).To this end, we propose a novel two-pass coarse-tofine mechanism to align sentences better.
Specifically, for the n images in V , we will take them as a whole to select n sentences from the document T with coarse-grained alignment, and then identify one-to-one matchings via fine-grained alignment.ITA consists of an image encoder, a sentence encoder, a coarse-grained alignment module, and a fine-grained alignment module.

Image Encoder
We first use ResNet152 to extract image features for each image in {v 1 , v 2 , • • • , v n }.These features are then fed into a Transformer-based encoder (Vaswani et al., 2017) as a whole to synthesize global knowledge and interaction information among all images.The position embeddings are not used here since image order information is unavailable.The final output of the image encoder is denoted as

Sentence Encoder
The sentence encoder here is the same as the one used in reference caption retrieval.For m sentences denoted as

Coarse-grained Alignment
To do coarse-grained alignment, we first apply a cross attention between sentences and images to refine sentence representations: where where D is 768 (the dimension of the image/text feature vectors).Noted that we have calculated an attention matrix A ∈ R m×n based on the equation 1 and 2 where a i,j is the element in the i-th row and j-th column of A.
The refined representation ġi is then fed to a sigmoid classifier to predict whether sentence t i will be selected: where W p ∈ R D×D and b ∈ R D are learnable parameters.
To train the model, we need n recommended sentences as references for a multimodal document with n images.For each image v i , we will calculate the ROUGE scores between sentences in the document and their reference captions generated in the first step, and the sentence with the highest score will be labeled as selected.If a sentence is selected more than once, we will pick another sentence with the next highest score.We use y i = 1 to denote that sentence t i is selected, and y i = 0 otherwise.Then, for the m sentences in the document we employ the binary cross-entropy loss to optimize the model as follow:

Fine-grained Alignment
Based on the coarse-grained alignment, we have calculated the an m × n cross-modal attention matrix (denoted as A), in which the element in the i-th row and j-th column is a i,j .In this step, we want further to identify optimal one-to-one relationships between images and these sentences.Generally, the larger the attention weight between t i and v j , the more likely t i and v j match.Suppose we have obtained n selected sentences denoted as t z 1 , t z 2 , . . ., t zn and we extract the rows corresponding to these sentences from the matrix A and concatenate them as a new attention matrix Ȧ : where Based on the new cross-modal attention matrix Ȧ, we can construct a complete weighted bipartite graph G containing two disjoint and independent vertice sets S and V , where |S| = n and |V | = n.So there are n × n weighted edges in G.The vertice v i in V represents an image, and vertice s j in S represents a sentence.The weight of the edge in G between the vertice v i ∈ V and the the vertice s j ∈ S is the value a ij ∈ R in Ȧ.Therefore, the fine-grained alignment of the sentences and images can be regarded as a maximum-weight perfect matching in the bipartite graphs G.We can easily utilize the bipartite graph matching algorithm (Kuhn-Munkres algorithm (Kuhn, 2010) in our implementation) to match the vertices in the two sets in the graph: where M = [I 1 , I 2 , . . ., I n ] , I i ∈ {1, 2, . . ., n} represents the index list of selected sentences(e.g., the first image is aligned with the I 1 -th sentence in the selected sentences), and KM represents the Kuhn-Munkres algorithm.

Text Summarization
We build the text summarization module based on BERTSum, a recent simple yet robust summarization model (Liu and Lapata, 2019)

Image Selection
Given the generated summary denoted as S and pseudo captions {t z 1 , t z 2 , . . ., t zn }, the image v whose pseudo caption t generates the highest ROUGE-L with the summary S, is selected as the most salient image, where: k ∈ {z 1 , z 2 , . . ., z n } and R(t k , S) represent the function which calculates the ROUGE-L socre between t k and S.
Please refer to appendix A and our released code for more architecture and implementation details 4 .
3 Experiment Settings

Dataset
We use the dataset build by Zhu et al. (2018)

Evaluation Metrics
Following Zhu et al. (2018Zhu et al. ( , 2020)) Meanwhile, we propose Caption-ROUGE-L, a metric specific to SITA and its variants by calculating ROUGE-L between a generated pseudo caption and the golden caption.

Baselines
We compare our method with the five multimodal summarization methods.(1) ATG (Zhu et al., 2018) is a multimodal attention model, which measures image salience by the visual attention distribution over the global image features.(2) ATL is an ATG variant using attention distributions over image patches.(3) HAN is an ATL variant by adding a hierarchical attention mechanism on image patches.(4) MOF (Zhu et al., 2020) introduces a multimodal objective function into ATG.Among the four MOF variants, we choose the one having the best performance in five of the seven metrics we used.( 5) UniMS (Zhang et al., 2021b) is a recent unified framework for multimodal summarization.
We also compare our method with the three text summarization methods.(1) PGN (See et al., 2017) is the Pointer-Generator Network for abstractive text summarization model.( 2) BERTSum is a recent robust BERT-based summarization model proposed by Liu and Lapata (2019), upon which our SITA is built.(3) BART (Lewis et al., 2020) is a pretrained seq2seq model consisting of a bidirectional encoder and an auto-regressive decoder.

Main Results
Table 1 and 2 show the performance of the baseline models and our method.By investigating the results, we have the following observations.
(1) Our SITA achieves improvements over baselines across all evaluation metrics of image precision, text summary quality, image-text relevance, and multimodal information integrity, clearly setting up a new state-of-the-art performance.
(2) Regarding the visual modality metric (IP), MOF generally outperforms its predecessor baselines by a slight margin due to its auxiliary training objective of image selection.UniMS further gain a notable improvement over MOF by distilling knowledge in a vision-language pre-trained model.Our SITA impressively improves more than 10% over UniMS in the precision of recommended images (e.g., 76.41 of SITA v.s.69.38 of UniMS on the IP metric).The reason is that the pseudo captions identified by our coarse-to-fine alignment mechanism provide much more informative clues for image selection.We will provide more detailed analyses in the following experiments.
(3) Regarding textual modality metrics, more comprehensive comparisons are shown in Table 2, which consists of three groups of results.In the first group, existing multimodal methods (ATL and MOF) demonstrate no superiority over the single-modal text summarization model they used (PGN).These efforts concluded that too many images could bring noise, and the long document had contained enough information for text generation (Zhu et al., 2018(Zhu et al., , 2020)).In contrast, our SITA (in the second group) gains a much more remarkable improvement, e.g., of 2.18 ROUGE-L, on text summaries, even based on a more robust base model (BERTSum).The latest state-of-the-art UmiMS (in the third group), built upon BART, also achieves performance improvements (e.g., +1.22 ROUGE-L) on text summarization, but not as evident as ours.Note that BART performs better than BERTSum on text summarization (e.g., 39.74 v.s.38.85 of ROUGE-L), but SITA still outperforms UmiMS.These results suggest that visual information actually benefits text generation, and our method exploits it more effectively.
(4) M sim , MR max , and MMAE++ are used to check image-text relevance, image-text integrity, and the overall effectiveness of pictorial summaries.As expected, SITA maintains dominance over baselines on the three intermodality metrics.These superiorities come from remarkable improvements on intramodality metrics and SITA's inherent capabilities of bridging the cross-modal semantic gap.
Note that IP and all intermodality metrics depend on the selected salient images, hence indirectly relying on the generated text summaries.Rigorously, baseline methods and our SITA utilize different text summarization models (e.g., PGC, BART, and BERTSum), so these metrics will be more friendly to methods with better-performed base text summarization model.However, this fact has minor impacts on our above analyses, since image selection improvements of SITA mainly benefit from pseudo captions but not the text summaries.

Results of One-pass Alignment Strategy.
To investigate how the coarse-to-fine alignment strategy boosts performance, we replace it with a single-pass alignment method, which is trained to select a pseudo caption for only one single image at a time.The results of this method variant (named One-pass) are shown in Table 3, from which we see notable performance degradation on all metrics.Through further qualitative exploration on its prediction results, we find this method tends to generate a small set of sentences repeatedly among different images, incapable of recalling enough relevant sentences.The low Caption-ROUGE-L of One-pass (e.g., 12.31) also verifies this observation.One possible reason is that images in a document can sometimes be similar, making the single-pass strategy fail to characterize the correlation and difference among these images.In contrast, by introducing the coarse-to-fine mechanism, our alignment model synthesizes multiple images from a global perspective in the coarse-grained pass, recalling more sentences more accurately and hence facilitating further fine-grained alignment.

Comparison with Simple Deduplication
To avoid recalling repeated sentences in one-pass alignment, one simple alternative strategy is introducing a deduplication mechanism.We hence implement One-pass (Dedup), which will select another sentence with the next highest score if the current sentence has been chosen.As shown in Table 3, we can see that the deduplication mechanism over one-pass image text alignment brings improvements (e.g., +0.65 on R-L and +7.04 on IP).But the performance of One-pass (Dedup) is still far from our full SITA with the coarse-to-fine alignment strategy (e.g., with a significant gap of 2.4 on R-L and 12.09 on IP).The main reason is that one image may align with multiple semantically rich sentences.For such an image, even with the deduplicating mechanism, one-pass alignment can only recall a single sentence, potentially missing critical information, especially when other images do not semantically overlap with it.That roughly explains the performance gaps.This comparison further verifies the necessity and soundness of the technical design of the two-pass coarse-to-fine alignment.
Figure 3: ROUGE-1 and ROUGE-L scores of simple summaries generated by simply concatenating pseudo captions (orange) or golden captions (blue) of a document's first k images.The scores are calculated by matching them against the reference summaries.The horizontal red (dashed) lines represent the text summaries generated by SITA.ROUGE-2 is similar to Rouge-1, which is not shown for better visualization.

Effects of Cross-modal Retrieval
To investigate the effect of the cross-modal retrieval, we directly retrieve pseudo captions in a document (rather than a summary), obtaining another method variant (named w/o ITA) requiring no image-text alignment training anymore.
As shown in Table 3, w/o ITA bring modest enhancement on text summaries (e.g., 38.85 of BERTSum v.s.38.97 of w/o ITA on ROUGE-L), while achieving more impressive image salience (e.g., 72.28 on IP).Compared with our full SITA, this method variant (named w/o ITA) demonstrates significant performance degeneration on both text and image salience (e.g., -1.06 on ROUGE-L and -3.04 on IP).These results reveal that (1) the knowledge in the pre-trained cross-modal retrieval model mainly helps image selection, and the image-text alignment over retrieval results is more critical for the overall performance; and (2) retrieving reference captions from summaries instead of documents is a key design of SITA.
Note that our cross-modal retrieval model is pre-trained with 113K image-text pairs.Though UniMS distills knowledge from a vision-language model pre-trained by more than 400M image-text pairs, SITA demonstrates significant superiority.We further analyze the effectiveness of our method from the perspective of pseudo captions' quality.We are interested in the relation between golden captions and our pseudo captions.In the MSMO's task settings, golden image captions are excluded.

Quality of Pseudo Captions
To perform this study, we allow the compared models to use golden captions in training under a easier task setting.Here we build another two baselines.The first one, named Caption-train, trains the image-text alignment model with golden captions instead of the reference sentences retrieved in the first step.We compare SITA with it on the metrics of ROUGE-{1, 2, L}, IP, and Caption-ROUGE-L.Looking into the empirical results shown in Table 4, the Caption-ROUGE-L of SITA and Captiontrain are generally similar.Hence, from the perspective of recovering image captions, the quality of aligned sentences generated by Caption-train and SITA are identical.However, SITA generates better text summaries and salient images than Caption-train (e.g., with improvements of 0.74 on ROUGE-L and 2.82 on IP), suggesting that our aligned sentences benefit more MSMO.The reason is that the reference captions used for alignment training are retrieved from text summaries, inherently making predicted pseudo captions imply better summary features.
The second one, named Caption-input, directly utilizes golden captions instead of pseudo captions as inputs for text summarization and image selection.We find that SITA also outperforms Captioninput on all metrics.The performance enhancement is less evident but still impressive, considering that SITA uses a more restricted task setting.This observation proves that the pseudo captions we generated are even better than the original image captions for MSMO.
The above analyses verify that pseudo captions are not only semantically consistent with images but also informative for text summarization.

Correlation between Image captions and Text Summaries
We also investigate the correlation between image captions and text summaries.Specifically, we construct a simple summary by concatenating golden (or pseudo) captions of the first k images in a document.Then, we calculate the ROUGE scores of those simple summaries.The results are shown in Figure 3, and we have the following observations: (1) Simply aggregating some (pseudo) image captions can generate generally good summaries.For example, when selecting more than three captions, the resulting summaries even have a better ROUGE-1 than MOF.The observation verifies the inherent capabilities of image captions on the briding cross-modal semantic gap.
(2) The upward trend of the ROUGE-L with the increase of k is not as notable as that of ROUGE-1.The reason is that text generated by sentence concatenation (in random order) may lack coherence.ROUGE-L is calculated based on the longest common substring, the length of which will be limited in this situation.This phenomenon suggests that an individual text summarization component is still required given these high-quality image captions.
(3) Generally, the red line is above the blue line most of the time, indicating that simple summaries constructed by pseudo captions are even better than their counterparts consisting of golden captions.The observation, again, verifies that pseudo captions generated by our image-text alignment mechanism are more informative than the original ones, in terms of improving MSMO performance.
Multimodal summarization takes data of more than one modalities as input and synthesizes information across different modalities to generate the output (UzZaman et al., 2011;Li et al., 2018;Sanabria et al., 2018;Fu et al., 2020;Im et al., 2021;Yu et al., 2021;Zhu et al., 2018Zhu et al., , 2020;;Li et al., 2020b;Jangra et al., 2020aJangra et al., ,b, 2021b;;Zhang et al., 2021a).Zhu et al. (2018) first propose generating pictorial summaries given a document and an image collection.Zhu et al. (2020) further introduced a extra cross-entropy loss for image selection.Recently, Zhang et al. (2021b) proposed to utilize knowledge distillation with a vision-language pre-trained model to help image selection, but the image precision was still far from ideal.

Conclusion
We have presented SITA, a multimodal Summarization method based on coarse-tofine Image-Text Alignment.SITA introduces a novel perspective of bridging the semantic gap between visual and textual modality by exploiting pseudo image captions.Our cross-modal alignment mechanism effectively generates high-quality pseudo image captions, enabling SITA to set up state-of-the-art performance easily.We discuss the feasibility and potential of leveraging pseudo image captions , and release code 2 , to inspire more future studies from our proposed perspective.

Limitations
By retrieving pseudo captions from summaries, one limitation is that the most relevant sentence for a specific image may not be in the summary.However, it has a trivial impact on the overall MSMO performance.If this happens, most of the time, the image will not be the salient image to select, and its caption will provide no helpful information for the text summary.In this situation, selecting a pseudo caption from summary sentences will not hinder the overall performance, though it may not be the best for the specific image.
Besides, even though our task setting (including the dataset and all evaluation metrics we used) strictly follows three previous works (Zhu et al., 2018(Zhu et al., , 2020;;Zhang et al., 2021b), another possible limitation is that only one MSMO benchmark is used (no other dataset exists).We believe providing more diversified datasets and investigating more about the rationale under the task setting are critical to pushing forward the multimodal summarization community, although they are out of the scope of this work.

A Implementation Details
We use Pytorch-Transformers3 to implement the Bert-base model.We use the Adam optimizer (Kingma and Ba, 2014) and set the learning rate to 0.0001.We limit the text length to 512 tokens and resize the resolution of each image to 224×224.The overall process is implemented with PyTorch (Paszke et al., 2019).We run our experiment using 2 NVIDIA V100 GPUs.The maximum number of training iterations is set to 200k, and we save the checkpoint every 2k iterations.We select the best checkpoints according to the validation loss and report the results on the test set.When training the image text alignment model, we freeze the weight of ResNet152 and use a maximum batch size of 512.When training the text summarization model, we use beam search in decoding and set the beam search size to 5. The batch size is set to 512, and each input in the batch contains a text article with 512 tokens and a pseudo caption set with 128 tokens.For more implementation details, please refer to our released code at Github4 .We use the MSMO dataset build by Zhu et al. (2018), which is the largest benchmark dataset.This dataset is constructed from the Daily Mail website5 , containing 293,965 articles for training, 10,355 articles for validation, and 10,261 articles for testing.Each article contains a text document, and approximately seven images on average.The manually written highlights offered by Daily Mail are taken as a reference text summary.Noted that the pictorial summaries are only available on the test set, so there is no label information about the salient images during training.Image captions are excluded from the dataset for generalization.

C Case Study
To qualitatively verify our proposed method's effectiveness, we conduct a case study on generated pseudo image captions and multi-modal summaries.As illustrated in Figure 5, the pseudo captions generated by our model can express image semantics appropriately.For the critical entities in the images, we can find the corresponding descriptions in the high-quality pseudo captions we produce.Compared with the text summary generated by single-modal and alternative multi-modal models, SITA's output captures the article's main point better, thanks to the effective incorporation of pseudo image captions implying visual knowledge.For example, the descriptions of "A robed figure" and "M16" are missing in the text summaries of compared models.In contrast, our SITA model generates a more accurate summary with the help of pseudo captions containing these essential facts, which also assists in identifying the salient image correctly.

D Rouge-2 of Simple Summaries
We only plot Rouge-1 and -L scores of simple summaries in Figure 3 for better visualization in limited space.The trend of Rouge-2 is similar to that of Rouge-1, as shown in Figure 4

Figure 1 :
Figure 1: Overview of text summarization and MSMO.Compared with text summarization models, existing MSMO methods usually use an extra image encoder to project images into intermediate representations.They identify the salient image by cross-modal attention, which could be inaccurate due to the lack of golden images for training.We explicitly transform an image into a concrete caption by image-text alignment, capturing visual knowledge better and making text summarization and image selection more effective yet simpler.

Figure 2 :
Figure 2: Coarse-to-fine Image-Text alignment.The left part (figure a) shows the overview of the whole image-text alignment mechanism.Reference captions are first retrieved from golden summaries based on a cross-modal retrieval model.We then train an image-text alignment model with reference captions as supervision signals, identifying a relevant sentence as a pseudo caption for each image.The right part (figure b) demonstrates how our two-pass coarse-to-fine alignment model works internally.

Figure 4 :
Figure 4: ROUGE-2 of simple summaries generated by simply concatenating pseudo captions (red) or golden captions (blue) of a document's first k images.The scores are calculated by matching them against the reference summaries.The horizontal red dashed lines represent the text summaries generated by our SITA model.

Table 1 :
. We concatenate all pseudo image captions as a new text document denoted as T s .The origin text document T and the new text document T s are fed into the encoder of BERTSum separately, generating two representation sequences R and R s .Then, unlike the tradi-Main results of different metrics.R-{1, 2, L} refers to ROUGE-{1, 2, L}.
tional Transformer decoder, we have two individual cross attention modules-corresponding to the two documents-after the self-attention module in each Transformer block.The outputs of the two cross attention modules are simply summed, leaving other components in the Transformer block unchanged.
, which is constructed from the Daily Mail website 1 , and contains 293,965 articles for training, 10,355 articles for validation, and 10,261 articles for testing.Please refer to appendix B for more dataset details.

Table 2 :
Comparison of text summary quality in terms of Rouge scores.The numbers in parentheses represent relative performance improvements of multimodal models over their single-modal ones (e.g., PGN, BERTSum and BART).↑ indicates a positive effect, and ↓ indicates a performance decrement.
IP is the abbreviation of Image Precision and used to evaluate image selection.It is defined by dividing the size of the intersection between the recommended images rec img and the reference images ref img by the number of recommended images.(3)Msimevaluates the image-text relevance by calculating the maximum similarity between the image and each sentence in the model summary.(4)MR max evaluates the information integrity of the multimodal summary.It exploits a joint multimodal representation to calculate the similarity between model outputs and multimodal references.
, we choose the following metrics.(1) ROUGE-{1, 2, L} is the standard text summarization evaluation metric.(2) (5) MMAE++ evaluates the overall quality of multimodal summaries.It projects both the candidate multimodal summary and the reference summary into a joint semantic space with a trained neural network.For the details of MMAE++, please check subsection 3.3 in Zhu et al. (2018)'s work.

Table 3 :
Performance of SITA and its variants.CR−L refers to Caption-ROUGE-L.-w/o ITA directly retrieves pseudo captions in a document without image-text alignment, One-pass does image-text alignment in a singlepass manner, and One-pass (Dedup) adds an sentence deduplication mechanism over One-pass.