SCCS: Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

Multimedia summarization with multimodal output (MSMO) is a recently explored application in language grounding. It plays an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. However, existing meth-ods extract features from the whole video and article and use fusion methods to select the representative one, thus usually ignoring the critical structure and varying semantics with video/document. In this work, we propose a Semantics-Consistent Cross-domain Summa-rization (SCCS) model based on optimal transport alignment with visual and textual segmentation. Our method first decomposes both videos and articles into segments in order to capture the structural semantics, and then follows a cross-domain alignment objective with optimal transport distance, which leverages multimodal interaction to match and select the visual and textual summary. We evaluated our method on three MSMO datasets, and achieved performance improvement by 8% & 6% of textual and 6.6% &5.7% of video summarization, respectively, which demonstrated the effectiveness of our method in producing high-quality multimodal summaries.


Introduction
New multimedia content in the form of short videos and corresponding text articles has become a significant trend in influential digital media.This popular media type has been shown to be successful in drawing users' attention and delivering essential information in an efficient manner.Multimedia summarization with multimodal output (MSMO) has recently drawn increasing attention.Different from traditional video or textual summarization (Gygli et al., 2014;Jadon and Jasim, 2020), where the generated summary is either a keyframe or textual description, MSMO aims at producing both visual and textual summaries simultaneously, making this Figure 1: We proposed a segment-level cross-domain alignment model to preserve the structural semantics consistency within two domains for MSMO.We solve an optimal transport problem to optimize the crossdomain distance, which in turn finds the optimal match.task more complicated.Previous works addressed the MSMO task by processing the whole video and the whole article together which overlooked the structure and semantics of different domains (Duan et al., 2022;Haopeng et al., 2022;Sah et al., 2017;Zhu et al., 2018;Mingzhe et al., 2020;Fu et al., 2021Fu et al., , 2020)).
The video and article can be regarded as being composed of several topics related to the main idea, while each topic specifically corresponds to one sub-idea.Thus, treating the whole video or article uniformly and learning a general representation ignores these structural semantics and easily leads to biased summarization.To address this problem, instead of learning averaged representations for the whole video & article, we focus on exploiting the original underlying structure.The comparison of our approach and previous works is illustrated in Figure 1.Our model first decomposes the video & article into segments to discover the content structure, then explores the cross-domain semantics relationship at the segment level.We believe this is a promising approach to exploit the consistency lie in the structural semantics between different domains.
Previous models applied attention or fusion mechanisms to compute image-text relevance scores, finding the best match of the sentences/images within the whole document/video, regardless of the context, which used one domain as an anchor.However, an outstanding anchor has more weight in selecting the corresponding pair.To overcome this, we believe the semantics structure is a crucial characteristic that can not be ignored.Based on this hypothesis, we propose Semantics-Consistent Cross-domain Summarization (SCCS), which explores segment-level crossdomain representations through Optimal Transport (OT) based multimodal alignment to generate both visual and textual summaries.We decompose the video/document into segments based on its semantic structure, then generate sub-summaries of each segment as candidates.We select the final summary from these candidates instead of a global search, so all candidates are in a fair competition arena.
Our contributions can be summarized as follow: • We propose SCCS (Semantics-Consistent Cross-domain Summarization), a segmentlevel alignment model for MSMO tasks.• Our method preserves the structural semantics and explores the cross-domain relationship through optimal transport to match and select the visual and textual summary.• On three datasets, our method outperforms baselines in both textual and video summarization results qualitatively and quantitatively.
• Our method serves as a hierarchical MSMO framework and provides better interpretability via OT alignment.The OT coupling shows sparse patterns and specific temporal structure for the embedding vectors of ground-truthmatched video and text segments, providing interpretable learned representations.
Since MSMO generates both visual & textual summaries, We believe the optimal summary comes from the video and text pair that are both 1) semantically consistent, and 2) best matched globally in a cross-domain fashion.In addition, our framework is more computationally efficient as it conducts cross-domain alignment at the segment level instead of inputting whole videos/articles.

Related Work
Multimodal Alignment Aligning representations from different modalities is important in multimodal learning.Exploring the explicit relationship across vision and language has drawn significant attention (Wang et al., 2020a).Xu et al. (2015);  2018) generated textual summaries by taking audio, transcripts, or documents as input along with videos or images, using seq2seq model (Sutskever et al., 2014) or attention mechanism (Bahdanau et al., 2015).Recent trending on the MSMO task has also drawn much attention (Zhu et al., 2018;Mingzhe et al., 2020;

Methods
SCCS is a segment-level cross-domain semantics alignment model for the MSMO task, where MSMO aims at generating both visual and language summaries.We follow the problem setting in Mingzhe et al. (2020), for a multimedia source with documents and videos, the document X D = {x 1 , x 2 , ..., x d } has d words, and the ground truth textual summary Y D = {y 1 , y 2 , ..., y g } has g words.A corresponding video X V is associated with the document in pair, and there exists a ground truth cover picture Y V that can represent the most important information to describe the video.Our SCCS model generates both textual summaries Y ′

D
and video keyframes Y ′ V .SCCS consists of five modules, as shown in Figure 3(a): video temporal segmentation (Section 3.1), visual summarization (Section 3.3), textual segmentation (Section 3.2), textual summarization (Section 3.4), and cross-domain alignment (Section 3.5).Each module will be introduced in the following subsections.

Video Temporal Segmentation
Video temporal segmentation splits the original video into small segments, which summarization tasks build upon.The segmentation is formulated as a binary classification problem on the segment boundaries, similar to Rao et al. (2020).For a video X V , the video segmentation encoder separates the video sequence into segments [X v1 , X v2 , ..., X vm ], where n is the number of segments.
As shown in Figure 3(b), the video segmentation encoder contains a VTS module and a Bi-LSTM (Graves and Schmidhuber, 2005).Video X V is first split into shots [S v1 , S v2 , ..., S vn ] (Castellano, 2021), then the VTS module takes a clip of the video with 2ω b shots as input and outputs a boundary representation b i .The boundary representation captures both differences and relations between the shots before and after.VTS consists of two branches, VTS d and VTS r , as shown in Equation 1.
where s i ∈ [0, 1] is the probability of a shot boundary being a scene boundary.The coarse prediction pvi ∈ {0, 1} indicates whether the i-th shot boundary is a scene boundary by binarizing s i with a threshold τ , pvi = . The results with pvi = 1 result in the learned video segments [X v1 , X v2 , ..., X vm ].

Textual Segmentation
The textual segmentation module takes the whole document or articles as input and splits the original input into segments based on context understanding.We used a hierarchical BERT as the textual segmentation module (Lukasik et al., 2020), which is the current state-of-the-art method.As shown in Figure 3

Visual Summarization
The visual summarization module generates visual keyframes from each video segment as its corresponding summary.We use an encoder-decoder architecture with attention as the visual summarization module (Ji et al., 2020), taking each video segment as input and outputting a sequence of keyframes.The encoder is a Bi-LSTM (Graves and Schmidhuber, 2005) to model the temporal relationship of video frames, where the input is X = [x 1 , x 2 , ..., x T ] and the encoded representation is E = [e 1 , e 2 , ...e T ].The decoder is a LSTM (Hochreiter and Schmidhuber, 1997) to To exploit the temporal ordering across the entire video, an attention mechanism is used: Similar in Hochreiter and Schmidhuber (1997), the decoder function can be written as: where s t is the hidden state, E t is the attention vector at time t, α i t is the attention weight between the inputs and the encoder vector, ψ is the decoder function (LSTM).To obtain α i t , the relevance score γ i t is computed by where the score function decides the relationship between the i-th visual features e i and the output scores at time t:

Textual Summarization
Language summarization can produce a concise and fluent summary which should preserve the critical information and overall meaning.Our textual summarization module takes BART (Lewis et al., 2020) as the summarization model to generate abstractive textual summary candidates.BART is a denoising autoencoder that maps a corrupted document to the original document it was derived from.As in Figure 3(a), BART is an encoder-decoder Transformer pre-trained with a denoising objective on text.We take the fine-tuned BART on CNN and Daily Mail datasets for the summarization task (See et al., 2017b;Nallapati et al., 2016).

Cross-Domain Alignment via OT
Our cross-domain alignment (CDA) module learns the alignment between keyframes and textual summaries to generate the final multimodal summaries.Our alignment module is based on OT, which has been explored in several cross-domain tasks (Chen et al., 2020a;Yuan et al., 2020;Lu et al., 2021).More OT introductions can be found in Appendix A.
As shown in Figure 3(d), in CDA, the image features V = {v k } K k=1 are extracted from pre-trained 1587 ResNet-101 (He et al., 2016) concatenated to faster R- CNN (Ren et al., 2015) as Yuan et al. (2020), where an image can be represented as a set of detected objects, each associated with a feature vector.For text features, every word is embedded as a feature vector and processed by a Bi-GRU (Cho et al., 2014) to account for context (Yuan et al., 2020).The extracted image and text embeddings are Yuan et al. (2020), we take image and text sequence embeddings as two discrete distributions supported on the same feature representation space.Solving an OT transport plan between the two naturally constitutes a matching scheme to relate cross-domain entities (Yuan et al., 2020).To evaluate the OT distance, we compute a pairwise similarity between V and E using cosine distance: Then the OT can be formulated as: where is the transport matrix, d k and d m are the weight of v k and e m in a given image and text sequence, respectively.We assume the weight for different features to be uniform, i.e., µ k = 1 K , v m = 1 M .The objective of optimal transport involves solving linear programming and may cause potential computational burdens since it has O(n 3 ) efficiency.To solve this issue, we add an entropic regularization term equation ( 5), and the objective of our optimal transport distance becomes: where H(T) = i,j T i,j log T i,j is the entropy, and λ is the hyperparameter that balance the effect of the entropy term.Thus, we are able to apply the celebrated Sinkhorn algorithm (Cuturi, 2013) to efficiently solve the above equation in O(nlogn).The optimal transport distance computed via the Sinkhorn algorithm is differentiable and can be implemented by Flamary et al. (2021).The algorithm is shown in Algorithm 1, where β is a hyperparameter, C is the cost matrix, ⊙ is Hadamard product, < •, • > is Frobenius dot-product, matrices are in bold, the rest are scalars.

Multimodal Summaries
During training the alignment module, the Wasserstein distance (WD) between each keyframesentence pair of all the visual & textual summary candidates is computed, where the best match is selected as the final multimodal summaries.

Datasets
We evaluated our models on three datasets: VMSMO dataset, Daily Mail dataset, and CNN dataset from Mingzhe et al. (2020); Fu et al. (2021Fu et al. ( , 2020)).The VMSMO dataset contains 184,920 samples, including articles and corresponding videos.Each sample is assigned with a textual summary and a video with a cover picture.We adopted the available data samples from Mingzhe et al. (2020).The Daily Mail dataset contains 1,970 samples, and the CNN dataset contains 203 samples, which include video titles, images, and captions, similar to Hermann et al. (2015).For data splitting, we take the same experimental setup as Mingzhe et al. (2020) for the VMSMO dataset.For the Daily Mail dataset and CNN dataset, we split the data by 70%, 10%, and 20% for train, validation, and test sets, respectively, same as Fu et al. (2021Fu et al. ( , 2020)).

Experimental Setting and Implementation
For the VTS module, we used the same model setting as Rao et al. (2020); Castellano (2021) and the same data splitting setting as Mingzhe et al. (2020); Fu et al. (2021Fu et al. ( , 2020) ) in the training process.
The visual summarization model is pre-trained on the TVSum (Song et al., 2015) and SumMe (Gygli et al., 2014) datasets.TVSum dataset contains 50 edited videos downloaded from YouTube in 10 categories, and SumMe dataset consists of 25 raw videos recording various events.Frame-level importance scores for each video are provided for both datasets and used as ground-truth labels.The input visual features are extracted from pre-trained GoogLeNet on ImageNet, where the output of the pool5 layer is used as visual features.
For the textual segmentation module, due to the quadratic computational cost of transformers, we reduce the BERT's inputs to 64-word pieces per sentence and 128 sentences per document as Lukasik et al. (2020).We use 12 layers for both the sentence and the article encoders, for a total of 24 layers.In order to use the BERT BASE checkpoint, we use 12 attention heads and 768dimensional word-piece embeddings.The hierarchical BERT model is pre-trained on the Wiki-727K dataset (Koshorek et al., 2018), which contains 727 thousand articles from a snapshot of the English Wikipedia.We used the same data splitting method as Koshorek et al. (2018).
For textual summarization, we adopted the pretrained BART model from Lewis et al. (2020), which contains 1024 hidden layers and 406M parameters and has been fine-tuned using CNN and Daily Mail datasets.
In the cross-domain alignment module, the feature extraction and alignment module is pretrained by MS COCO dataset (Lin et al., 2014) on the image-text matching task.We added the OT loss as a regularization term to the original matching loss to align the image and text more explicitly.
For the VMSMO dataset, the quality of the chosen cover frame is evaluated by mean average precision (MAP) and recall at position (R n @k) (Zhou et al., 2018c;Tao et al., 2019), where (R n @k) measures if the positive sample is ranked in the top k positions of n candidates.For the Daily Mail dataset and CNN dataset, we calculate the cosine image similarity (Cos) between image references and the extracted frames (Fu et al., 2021(Fu et al., , 2020)).

Results and Discussion
The comparison results on the VMSMO dataset of multimodal, video, and textual summarization are shown in Table 1.Synergistic and PSAC are pure video summarization approaches, which did not perform as well as multimodal methods, like MOF or DIMS, which means taking additional modality into consideration actually helps to improve the quality of the generated video summaries.Table 1 also shows the absolute performance improvement or decrease compared with the MSMO baseline, where the improvements are marked in red and decreases in blue.Overall, our method shows the highest absolute performance improvement than the previous methods on both textual and video summarization results.Our method shows the ability to preserve the structural semantics and is able to learn the alignment between keyframes and textual deceptions, which shows better performance than the previous ones.If comparing the quality of generated textual summaries, our method still outperforms the other multimodal baselines, like MSMO, MOF, DIMS, and also traditional textual summarization methods, like Lead, TextRank, PG, Unified, and GPG, showing the alignment obtained by optimal transport can help to identify the crossdomain inter-relationships.
In Table 2, we show the comparison results with multimodal baselines on the Daily Mail and CNN datasets.We can see that for the CNN datasets, our method shows competitive results with Img+Trans, TFN, HNNattTI, and M 2 SM on the quality of generated textual summaries.While on the Daily Mail dataset, our approach showed better performance on both textual summaries and visual summaries.We also compare with the traditional pure video summarization baselines and pure textual summarization baselines on the Daily Mail dataset, and the results are shown in Table 2.We can find that our approach performed competitive results compared with NN-SE and M 2 SM for the quality of the generated textual summary.For visual summarization comparison, we can find that the quality of generated visual summary by our approach still outperforms the other visual summarization baselines.Still, we also provide absolute performance comparison with baseline MSMO (Zhu et al., 2018), as shown in Table 2, our model achieved the highest performance improvement in both Daily Mail and CNN datasets compared with previous baselines.If comparing the quality of generated textual summaries with language model (LM) baselines, our method also outperforms T5, Pegasus, and BART.

Human Evaluation
To provide human evaluation results, we asked 5 people (recruited from the institute) to score the results generated by different approaches of CNN and DailyMail datasets.We asked the human judges to score the results of 5 models: MSMO, TFN, HNNattTI, M 2 SM, and SCCS, as 1-5, where 5 represents the best results.We averaged the voting results from 5 human judges.The performances of 5 models are listed in Table 3, showing the result by SCCS is better than the baselines.

Factual Consistency Evaluation
Factual consistency is used as another important evaluation criterion for evaluating summarization results (Honovich et al., 2022).For factual consistency, we adopted the method in Xie et al. ( 2021) and followed the same setting.The same human annotators from Sec 5.4 provided human judgments.We report Pearson correlation coefficient Coe P here.The results of MSMO, Img+Trans, TFN, HNNaatTI, M 2 SM, and ours, are shown in Table 4.In summary, our methods show better results than baselines on factual consistency evaluations.

Ablation Study
To evaluate each component's performance, we performed ablation experiments on different modalities and different datasets.For the VMSMO dataset, we compare the performance of using only visual information, only textual information, and multimodal information.The comparison result is shown in Table 1.We also carried out experiments on different modalities using Daily Mail dataset to show the performance of unimodal and multimodal components, and the results are shown in Table 2.For ablation results, when only textual data is available, we adopt BERT (Devlin et al., 2019) to generate text embeddings and K-Means clustering to identify sentences closest to the centroid for textual summary selection.While if only video data is available, we solve the visual summarization task in an unsupervised manner, using K-Means clustering to cluster frames using the image histogram and then select the best frame from clusters based on the variance of laplacian as the visual summary.
From Table 1 and Table 2, we can find that multimodal methods outperform unimodal approaches, showing the effectiveness of exploring the relationship and taking advantage of the cross-domain alignments of generating high-quality summaries.

Interpretation
To show a deeper understanding of the multimodal alignment between the visual domain and language domain, we compute and visualize the transport plan to provide an interpretation of the latent representations, which is shown in Figure 4.When we are regarding the extracted embedding from both text and image spaces as the distribution over their corresponding spaces, we expect the optimal transport coupling to reveal the underlying similarity and structure.Also, the coupling seeks sparsity, which further helps to explain the correspondence between the text and image data.
Figure 4 shows comparison results of matched image-text pairs and non-matched ones.The top two pairs are shown as matched pairs, where there is an overlap between the image and the corresponding sentence.The bottom two pairs are shown as non-matched ones, where the overlapping of meaning between the image and text is relatively small.The correlation between the image domain and the language domain can be easily interpreted by the learned transport plan matrix.In specific, the optimal transport coupling shows the pattern of sequentially structured knowledge.However, for non-matched image-sentences pairs, the estimated couplings are relatively dense and barely contain any informative structure.As shown in Figure 4, we can find that the transport plan learned in the cross-domain alignment module demonstrates a way to align the features from different modalities to represent the key components.The visualization of the transport plan contributes to the interpretability of the proposed model, which brings a clear understanding of the alignment module.

Conclusion
In this work, we proposed SCCS, a segment-level Semantics-Consistent Cross-domain Summarization model for the MSMO task.Our model decomposed the video & article into segments based on the content to preserve the structural semantics, and explored the cross-domain semantics relationship via optimal transport alignment at the segment level.The experimental results on three MSMO datasets show that SCCS outperforms previous summarization methods.We further provide interpretation by the OT coupling.Our approach provides a new direction for the MSMO task, which can be extended to many real-world applications.

Limitations
Due to the absence of large evaluation databases, we only evaluated our method on three publicly available datasets that can be used for the MSMO task.The popular video databases, i.e., COIN and Howto100M datasets, can not be used in our task, since they lack narrations and key-step annotation.So a large evaluation database is highly needed for evaluating the performance of MSMO approaches.
As the nature of the summarization task, human preference has an inevitable influence on the performance, since the ground-truth labels were provided by human annotators.It's somehow difficult to quantitatively specify the quality of the summarization result, and current widely used evaluation metrics may not reflect the performance of the results very well.So we are seeking some new directions to find another idea for quality evaluation.
The current setting is short videos & short documents, due to the constrain of available data.To extend the current MSMO to a more general setting, i.e., much longer videos or documents, new datasets should be collected.However, this requires huge human effort in annotating and organizing a high-value dataset, which is extremely time-consuming and labor-intensive.Nevertheless, we believe the MSMO task is promising and can provide valuable solutions to many real-world problems.So if such a dataset is collected, we believe it could significantly boost the research in this field.

Ethics Statement
Our work aims at providing a better user experience when exploring online multimedia, and there is no new dataset collected.To the best of our knowledge, this application does not involve ethical issues, and we do not foresee any harmful uses of this study.

A Optimal Transport (OT) Basis
OT is the problem of transporting mass between two discrete distributions supported on latent feature space X .Let µ = {x i , µ i } n i=1 and v = y j , v j m j=1 be the discrete distributions of interest, where x i , y j ∈ X denotes the spatial locations and µ i , v j , respectively, denoting the non-negative masses.Without loss of generality, we assume is a valid transport plan if its row and column marginals match µ and v, respectively, which is i π ij = v j and j π ij = µ i .Intuitively, π transports π ij units of mass at location x i to new location y j .Such transport plans are not unique, and one often seeks a solution π * ∈ Π(µ, v) that is most preferable in other ways, where Π(µ, v) denotes the set of all viable transport plans.OT finds a solution that is most cost effective w.r.t.cost function C(x, y): where D(µ, v) is known as OT distance.D(µ, v) minimizes the transport cost from µ to v w.r.t.C(x, y).When C(x, y) defines a distance metric on X , and D(µ, v) induces a distance metric on the space of probability distributions supported on X , it becomes the Wasserstein Distance (WD).

B More Related Work
Optimal Transport OT studies the geometry of probability spaces (Villani, 2003), a formalism for finding and quantifying mass movement from one probability distribution to another.OT defines the Wasserstein metric between probability distributions, revealing a canonical geometric structure with rich properties to be exploited.The earliest contribution to OT originated from Monge in the eighteenth century.Kantorovich rediscovered it under a different formalism, namely the Linear Programming formulation of OT.With the development of scalable solvers, OT is widely applied to many real-world problems and applications (Flamary et al., 2021;Chen et al., 2020a;Yuan et al., 2020;Zhu et al., 2021;Klicpera et al., 2021;Alqahtani et al., 2021;Lee et al., 2019;Chen et al., 2019;Duan et al., 2022).
Video Summarization Video summarization aims at generating a short synopsis that summarizes the video content by selecting the most informative and vital parts.The summary usually contains a set of representative video keyframes or video key-fragments that have been stitched in chronological order to form a shorter video.The former type is known as video storyboard, and the latter one is known as video skim (Apostolidis et al., 2021).Traditional video summarization methods only use visual information, extracting important frames to represent the video content.For instance, Gygli et al. ( 2014); Jadon and Jasim (2020) generated video summaries by selecting keyframes using SumMe and TVSum datasets.Some categorydriven or supervised training approaches were proposed to generate video summaries with videolevel labels (Song et al., 2015;Zhou et al., 2018a;Xiao et al., 2020;Zhou et al., 2018b).
Textual Summarization Textual summarization takes textual metadata, i.e., documents, articles, tweets, etc, as input, and generates textual summaries, in two directions: abstractive summarization and extractive summarization.Abstractive methods select words based on semantic understanding, and even the words may not appear in the source (Tan et al., 2017;See et al., 2017b).Extractive methods attempt to summarize language by selecting a subset of words that retain the most critical points, which weights the essential part of sentences to form the summary (Narayan et al., 2018;Wu and Hu, 2018).Recently, the fine-tuning approaches have improved the quality of generated summaries based on pre-trained language models in a wide range of tasks (Liu and Lapata, 2019;Zhang et al., 2019c).
Video Temporal Segmentation Video temporal segmentation aims at generating small video segments based on the content or topics of the video, which is a fundamental step in content-based video analysis and plays a crucial role in video analysis.
Textual Segmentation Textual segmentation aim at dividing the text into coherent, contiguous, and semantically meaningful segments (Nicholls, 2021).These segments can be composed of words, sentences, or topics, where the types of text include blogs, articles, news, video transcript, etc.Previous work focused on heuristics-based methods (Koshorek et al., 2018;Choi, 2000), LDA-based modeling algorithms (Blei et al., 2003;Chen et al., 2009), or Bayesian methods (Chen et al., 2009;Riedl and Biemann, 2012).Recent developments in NLP developed large models to learn huge amount of data in the supervised manner (Mikolov et al., 2013;Pennington et al., 2014;Li et al., 2018;Wang et al., 2018).Besides, unsupervised or weaklysupervised methods has also drawn much attention (Glavas et al., 2016;Lukasik et al., 2020).D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.

C Baselines
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure 2: A real example of the summarization process given by our SCCS method.Here we conduct OT-based cross-domain alignment to each keyframe-sentence pair, and a smaller OT distance means better alignment.(For example, the best-aligned text and image summary (0.08) delivers the flooding content clearly and comprehensively.)

Figure 3 :
Figure 3: (a) The computational framework of the SCCS model, which takes multimodal inputs (videos & text documents) and generates multimodal summaries.The framework includes five modules: video temporal segmentation, visual summarization, textual segmentation, textual summarization, and multimodal alignment.(b) The structure of the video segmentation encoder.(c) The architecture of the textual segmentation module.(d) The multimodal alignment module for multimodal summaries. bi

Figure 4 :
Figure 4: The OT coupling shows sparse patterns and specific temporal structure for the embedding vectors of ground-truth-matched video and text segments.

Table 1 :
Comparison with multimodal baselines on the VMSMO dataset.The absolute performance comparison with the baseline MSMO method is marked in red (better) and blue (worse).

Table 3 :
Human evaluation results.

Table 2 :
Comparisons of multimodal baselines on the Daily Mail and CNN datasets.The absolute performance comparison with the baseline MSMO method is marked in red (better) and blue (worse).
(Lewis et al., 2020))16)l., , 2020))attTI aligned the sentences and accompanying images by using attention mechanism.M 2 SM(Fu et al., 2021(Fu et al., , 2020)): M 2 SM is a multimodal summarization model with a bi-stream summarization strategy for training by sharing the ability to refine significant information from long materials in text and video summarization.Video summarization baselines: VSUMM(De Avila et al., 2011): VSUMM is a methodology for the production of static video summaries, which extracted color features from video frames and adopted k-means for clustering.DR-DSN(Zhou et al., 2018a):Zhou et al. (2018a)formulated video summarization as a sequential decision making process and developed a deep summarization network (DSN) to summarize videos.DSN predicted a probability for each frame, which indicates the likelihood of a frame being selected, and then takes actions based on the probability distributions to select frames to from video summaries.Textual summarization baselines: Lead3(See et al., 2017a): Similar to Lead, Lead3 means picking the first three sentences as the summary result.NN-SE(Cheng and Lapata, 2016): NN-SE is a general framework for single-document summarization composed of a hierarchical document encoder and an attention-based extractor.T5(Raffel et al., 2019): T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, inluding summarization.Pegasus(Zhang et al., 2019a): Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models (PEGA-SUS) uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoderdecoder model.BART(Lewis et al., 2020): BART is a sequenceto-sequence model trained as a denoising autoencoder, and showed great performance a variety of text summarization datasets.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4 and Section 5 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4 and Section 5 C4.I you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4 and Section 5 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Dd you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
C.1 Baselines for the VMSMO dataset