Multimodal summarization with multimodal output (MSMO) faces a challenging semantic gap between visual and textual modalities due to the lack of reference images for training. Our pilot investigation indicates that image captions, which naturally connect texts and images, can significantly benefit MSMO. However, exposure of image captions during training is inconsistent with MSMO’s task settings, where prior cross-modal alignment information is excluded to guarantee the generalization of cross-modal semantic modeling. To this end, we propose a novel coarse-to-fine image-text alignment mechanism to identify the most relevant sentence of each image in a document, resembling the role of image captions in capturing visual knowledge and bridging the cross-modal semantic gap. Equipped with this alignment mechanism, our method easily yet impressively sets up state-of-the-art performances on all intermodality and intramodality metrics (e.g., more than 10% relative improvement on image recommendation precision). Further experiments reveal the correlation between image captions and text summaries, and prove that the pseudo image captions we generated are even better than the original ones in terms of promoting multimodal summarization.
In this paper, we reconsider the problem of (partial) false negative samples from the Mutual Information (MI) Maximization perspective, the traditional contrastive loss (like InfoNCE loss) will equally push away the anchor of all positive samples and negative samples regardless of their possible semantic similarities. We theoretically show that InfoNCE loss will not only maximize the MI between the anchor and positive samples but minimize the MI between the anchor and false negative samples even though they share similar semantic which could provide a possible theoretical explanation for the observation of the existence of false negative samples in the cross-modal contrastive learning will decrease the downstream task performance of VLP models. Above analysis motivate us to propose the VLP model with a novel Semantic Awared Contrastive Learning framework named SACL where different negative samples are assigned with different contrastive weights according to the semantic similarity between them and the anchor.
Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with Text-Relevant Image Patch Selection, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a variety of popular benchmark datasets demonstrate that TRIPS gain a speedup of 40% over previous similar VLP models, yet with competitive or better downstream task performance.