Video Paragraph Captioning as a Text Summarization Task

Video paragraph captioning aims to generate a set of coherent sentences to describe a video that contains several events. Most previous methods simplify this task by using ground-truth event segments. In this work, we propose a novel framework by taking this task as a text summarization task. We first generate lots of sentence-level captions focusing on different video clips and then summarize these captions to obtain the final paragraph caption. Our method does not depend on ground-truth event segments. Experiments on two popular datasets ActivityNet Captions and YouCookII demonstrate the advantages of our new framework. On the ActivityNet dataset, our method even outperforms some previous methods using ground-truth event segment labels.


Introduction
Video captioning, the task of describing the content of a video in natural language, is a popular task both in computer vision and natural language processing. In the beginning, researchers try to generate sentence-level captions for short video clips (Venugopalan et al., 2015). Krishna et al. (2017) propose the task of dense video captioning. The system needs to detect event segments first and then generate captions. Park et al. (2019) propose the task of video paragraph captioning: they use ground-truth event segments and focus on generating coherent paragraphs. Lei et al. (2020) follow the task setting and propose a recurrent transformer model that can generate more coherent and less repetitive paragraphs. Considering the groundtruth event segments are often unavailable in practice, our goal is to generate paragraph captions without ground-truth segments.
The conventional framework of video paragraph captioning is shown in Figure 1a. Given an untrimmed video, an Event Detection module out- puts a set of non-redundant event segments. The Event Captioning module generates captions for these segments. The works of (Park et al., 2019;Lei et al., 2020) use ground-truth event segments and focus on the Event Captioning module.  use extra humanannotated bounding boxes as supervision. (Sah et al., 2017;Zhou et al., 2018;Mun et al., 2019) use predicted event segments and generate captions based on them. Sah et al. (2017) also summarizes these captions to generate a paragraph. The above methods heavily depend on accurate event segments. According to previous works (Zhou et al., 2018;Mun et al., 2019), the performance of the Event Detection module is not so good, making it a performance bottleneck. To tackle this problem, we propose a novel framework VPCSum as shown in Figure 1b. For a given video, we first extract dense event segment candidates (we call proposals), and a Proposal Captioning module is used to generate proposal captions. Then we treat video paragraph captioning as a text summarization task to obtain the final summary (paragraph caption).
In this work, we only consider extractive summarization, where the paragraph caption is composed by selecting from proposal captions. We conduct experiments on two popular datasets ActivityNet Captions and YouCookII. The results demonstrate the advantages of our framework. On the Activ-ityNet Captions dataset, our method even outperforms some previous methods using ground-truth event segment labels.

Our VPCSum Method
As illustrated in Figure 1b, our framework has three modules. Proposal Extraction: it extracts dense proposals for a video; Proposal Captioning: it generates captions for extracted proposals; Caption Summarization: it summarizes the generated proposal captions to obtain the video paragraph caption. We will introduce each module next.

Proposal Extraction
For proposal extraction, we use the BMN model (Lin et al., 2019), a popular model for temporal action proposal generation. It can extract complete and accurate proposals. We extract the top 100 proposals for each video.

Proposal Captioning
For proposal captioning, we choose the TSRM-RNN model  for ActivityNet Captions and VTransformer model (Lei et al., 2020) for YouCookII according to proposal captioning performance. We believe that if we choose a better sentence-level captioning model, the performance can be further improved. The caption summarization module summarizes proposal captions to generate the final video paragraph caption. In this work, we focus on extractive summarization. The architecture of our summarization model is illustrated in Figure 2. We first sort the proposal captions according to the proposal start time and add special [CLS] and [SEP] tokens to the beginning and end of each caption. We use the summation of token embeddings, segment embeddings, and position embeddings to represent each word. The input representations are fed into a pre-trained BERT model (Devlin et al., 2018), after which we obtain the contextual token representations. We use the contextual vectors of [CLS]s to represent each caption and feed them into stacked transformer layers (Vaswani et al., 2017). We use a sigmoid layer to compute the score of each caption:

Caption Summarization
where W and b are trainable parameters, h L i is the vector for caption i from the top transformer layer.
For extractive summarization, we need to annotate each sentence according to the gold summary as our training target. Many researchers use a greedy algorithm (Nallapati et al., 2016), sentences are selected one by one to maximize the ROUGE score against the gold summary. The selected sentences are labeled 1 while others are labeled 0 (hard-label). In our task, we find a more effective soft-label annotation method. We label caption c i with the max ROUGE score against gold captions and use binary cross-entropy as our loss function: where g j is the j-th gold caption.

Leverage Visual Information
The above caption summarization module assigns each proposal caption a predicted score, indicating how likely it appears in the final paragraph caption.
The predicted score only depends on text information. To leverage visual information, we need a "visual summarization" module, which gives a visually weighting score to each proposal. The ESGN model (Mun et al., 2019) seems a good choice for us. It uses a pointer network to select events from proposals and assigns a visually weighting score for each proposal. We use this model to compute the visually weighting score. Now we can extract the final paragraph caption. The final score of each proposal caption is a weighted sum of the textually weighting score s txt and the visually weighting score s vis : where λ is a hyper-parameter tuned on validation set. We select captions according to score(i) and use Trigram Blocking to reduce redundancy, as in Liu and Lapata (2019).

Datasets
We conduct experiments on ActivityNet Captions (Krishna et al., 2017) and YouCookII (Zhou et al., 2017). ActivityNet Captions contains 10,009 videos in train set, 4,917 videos in val set. Each video has 3.65 event segments on average. Following (Lei et al., 2020), the original val set is split into ae-val with 2,460 videos for validation and ae-test with 2,457 videos for test. YouCookII contains 1,333 videos in train set, 457 videos in val set. Each video has 7.70 event segments on average.

Implementation Details
For video preprocessing, we use appearance and optical flow features provided by Zhou et al. (2018). For BMN model and captioning models, we use the same hyperparameters suggested by the authors.
For ESGN model, we use a transformer encoder instead of an RNN encoder, with hidden size set to 512, number of heads set to 8, number of layers set to 3. For our caption summarization model, we use the base BERT model, 2 stacked transformer layers with hidden size set to 768, number of heads set to 8. We set max input length to 1,700, batch size to 10, λ to 1 for ActivityNet Captions and max input length to 1,000, batch size to 1, λ to 1 for YouCookII. Warmup steps are set to step num of 1 epoch. We use Adam optimizer with an initial learning rate of 6e − 4.

Baselines and Results
We compare our VPCSum model with the following baselines. Soft-NMS: it uses Soft-NMS (Bodla et al., 2017) to select event segments from BMN proposals, and uses the proposal captioning model to generate captions; ESGN: similar to Soft-NMS, but it uses ESGN model (Mun et al., 2019) to select event segments from BMN proposals; V-Trans: a Vanilla Transformer model, proposed by (Zhou et al., 2018); Trans-XL: a Transformer-XL model, proposed by (Lei et al., 2020); MART: a recurrent transformer model (Lei et al., 2020); COOT: it uses pretrained features to train MART model (Ging et al., 2020). Originally, the last four models deal with ground-truth event segments. For fair comparison, we also test them with predicted event segments generated by ESGN model 2 . Tables 1 and 2 show the results on ActivityNet Captions and YouCookII. We can observe that on the ActivityNet Captions, our model VPCSum within the new framework can generate better paragraph captions with higher Bleu@4, METEOR, and CIDEr and lower repetition score R@4, even outperforming V-trans*, Trans-XL*, MART* models using ground-truth event segments on every metric. On the YouCookII dataset, our model outperforms the models in the same setting but is inferior to the models using ground-truth segments. This may be because YouCookII has more segments  (7.70 vs 3.65) than ActivityNet Captions. Table 3 shows the ablation study on ActivityNet Captions. Compared to our full model (Full), the traditional extractive summarization annotation method (Hard-label) is not suitable for our task. If we set λ in Eq.(4) to 0 (w/o vis), the model loses useful visual information and performs not well. If we remove Trigram Blocking (w/o tri-blk), the performance also degrades and repetition becomes a problem (R@4 increases to 7.91). To verify the role of pretrained BERT model, we retrain our VPCSum without BERT pretrained weights (w/o pretrain). We can see that BERT pretrained weights are not the major factor to the final performance. We also replace our summarization model with unsupervised methods LexRank (Erkan and Radev, 2004) and LSA (Steinberger and Jezek, 2004

Qualitative Results
Ground Truth: A girl jumps onto a balance beam. She does a gymnastics routine on the balance beam. She does a flip off the balance beam and lands on a mat. MART: A gymnast is seen standing ready with her arms up and leads into her performing a gymnastics routine. She continues performing several flips and tricks and tricks and ends with her jumping down and walking away. She continues her routine and ends with her jumping down and jumping down and walking away. MART*: A gymnast is seen standing before a beam and begins performing a gymnastics routine. The girl then performs a routine on the beam and ends with her jumping down and jumping down and jumping. The girl jumps off the beam and lands on the mat and jumps off the beam. VPCSum: A gymnast is seen standing ready with her arms up and begins to do a routine. She does a gymnastics routine on the beam. She dismounts and lands on the mat. We show an example in Figure 3 with paragraph captions generated by MART, MART* and our VPCSum model. Compared to other models, our model can generate more clear and correct sentences with less redundancy. The generated paragraph of our model can better describe the process of the whole event.  We also conduct a human evaluation on randomly sampled 50 videos from the ActivityNet Captions val set. The annotators are asked to choose the better caption from two models in two aspects: relevance (how related is the caption to the video content) and diversity (how diverse is the generated text). We compare our VPCSum model with MART and MART* respectively. We have 17 college students as our annotators. Each video is judged by 3 annotators. We show the results of the pairwise experiments in Table 4. Our VPCSum model performs better in relevance and diversity, and more people choose the caption of our model as the better one.

Conclusion
In this work, we view the task of video paragraph captioning as a text summarization task and propose a novel framework VPCSum. It allows us to use text summarization techniques to handle this challenging task. Experimental results on two popular datasets show the advantages of our model. In the future, we will explore using abstractive summarization methods to generate better video paragraph captions.