Hierarchical Context-aware Network for Dense Video Event Captioning

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the video-level HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/KirkGuo/HCN.


Introduction
With the increase of video data uploaded online every day, the acquisition of knowledge from videos especially for Howto tasks is indispensable for people's daily life and work. However, watching a whole long video is time-consuming. Existing technologies focus on two main research directions to compact video information: video summarization to trim long videos to short ones and (dense) video captioning to generate a textual description of the key events in the video. Typically for long untrimmed videos, dense video event captioning generates fine-grained captions for all events to facilitate users quickly skimming the video content and enables various applications e.g. video chaptering and search inside a video. * Equal contribution Figure 1: A showcase of dense video event captioning. Given a video and the speech text, the task is to generate event proposals and captions.
Dense video event captioning (Krishna et al., 2017) and multi-modal video event captioning (Iashin and Rahtu, 2020b) aims to generate a sequence of captions for all events regarding to uni-modality (video) or multi-modality (video + speech) inputs. Figure 1 presents a showcase, which demonstrates the challenges of this task from both vision and speech text perspective. For vision understanding, the fine-grained objects are hard to recognize due to ambiguity, occlusion, or state change. In this case, the object "dough" is occluded in event 1 and is hard to recognize from the video. However, it can be recognized from the previous neighbor video frame with a clear appearance. From speech text perspective, although the speech text offers semantic concepts (Shi et al., 2019;Iashin and Rahtu, 2020b), it brings another challenge of co-reference and ellipsis in speech text due to the informal utterance of oral speeches. In the case of Figure 1, the entity "dough" in event 3 is an ellipsis in the text. Nonetheless, it is capable of generating consistent objects "dough" in event 3 with the contextual information from other events such as event 1 in this example. To sum up, both local neighbor-clip and global inter-event contexts are important for event-level captioning to generate coherent and less duplication descriptions between events.
Previous endeavors widely used recurrent neural network (Krishna et al., 2017) which suffers from capturing long dependency, while recently attention-based model (Zhou et al., 2018b;Sun et al., 2019b,a) is becoming the new paradigm for dense video event captioning and effective for multi-modal video captioning (Shi et al., 2019;Iashin and Rahtu, 2020b). However, existing attention-based models generate the captioning only relying on the video clip inside each event, and ignore video-level local and global context. Motivated by this, we mainly investigate how to effectively and jointly leverage both local and global context for video captioning.
In this paper, we propose a novel hierarchical context-aware model for dense video event captioning (HCN) to capture both the local and global context simultaneously. In detail, we first exploit a local context encoder to embed the visual and linguistic features of the source and surrounding clips, then design a global context encoder to capture relevant features from other events. Specifically, we apply different mechanisms: a flat attention module between the source and local context; a cross attention module for the source to select the global context. With regards to the neighbor frames (temporally close) usually alike, e.g. with the same objects, the flat attention is a full interaction to generate accurate and coherent captions. Contemporaneously, the cross attention on global context can selectively attend to the relevant events and capture prior temporal dependency between events to generate coherent and less duplicate captions. The experimental results demonstrate the effectiveness of our model. Our contributions can be summarized as: 1) We propose a hierarchical context-aware model for dense video event captioning to capture video-level context.
2) We carefully design different mechanisms to capture both local and global context: a flat attention model with full interaction between neighbor frames and a cross attention model to selectively capture inter-event features.
3) Experimental results on both Youcook2 and Activitynet Captions dataset demonstrate the effectiveness of our models and outperforms the context-agnostic model to a large extent.

Preliminary
The dense video event captioning task is to produce a sequence of events and generate a descriptive sentence for each event given a long untrimmed video. In this work, we focus only on the task to generate captions and directly apply the ground-truth event proposals similar to (Hessel et al., 2019;Iashin and Rahtu, 2020b). The paradigm for video captioning is an encoder-decoder network, which inputs video features and outputs descriptions for each event.
In this section, we describe the task formulation including the context-agnostic model as well as the context-aware model in one framework.

Overview
Problem Definition We define a sequence of event segment proposals as e = e i |i ∈ [1, m] , representing the video with m proposals, e i is the feature of the i-th event including both video and text feature, e i = {v i , t i }, where v i is video feature and t i is transcript text feature (if available) of the i-th event. We take all the video frames and transcript tokens of the event between the start and end time. The number of video frames is likely to be different from the number of text tokens depending on the actual video clip. Given all events e, the goal is to predict the target descriptive sentences Y = y i |i ∈ [1, m] . Each y i is a sequence of descriptive words corresponding to each event e i . The probability of the expected sentences Y.
which is to predict y i conditioned on the event e i . The context-aware model considers local context v =i (the neighboring video clip) and global context e =i (the clips of past and future events) respectively. The context-aware probability can be approximated as 3 Methodology

Context-agnostic model
The context-agnostic model of captioning is to generate a descriptive sentence given the shorttrimmed video clip of each event. The paradigm for multi-modal video captioning is an encoderdecoder network as in (Hessel et al., 2019). First, we pre-process each event and extract features separately. For the event e i , we extract both video feature v i and transcript feature t i if available. Next, both the video features and transcript features are concatenated together as the input to the transformer encoder. This encoder implements selfattention of each modality and cross attention between both modalities in one unified transformer. Finally, a transformer decoder generates the text tokens of the description with the enhanced features.

Context-aware model
We propose a context-aware video event captioning model with a hierarchical context-aware network (HCN) and the architecture is a general framework for either uni-modal or multi-modal inputs as explained in Figure 2. , context gate, and decoder. LCM enhances the visual feature by local context and optionally fuse both visual and text features with multi-modal inputs. GCM employs a cross attention model to encode the source visual feature with other event features, which employs the SEncoder to encode source and context separately and adopts the CEncoder to selectively attend to context.

Multi-modal Feature Representation
For visual features, we adopt a pre-trained 3D feature extractor to extract k features as v i = v j |j ∈ [1, k] of the i-th event. We further add a projection layer to map the raw feature to the input dimension through an embedding layer f For transcript text, we tokenize the text into words and represent each word with 1-hot representation. The tokens within each event are represented as where l is the length of the tokens corresponding to the number of the transcript text in the speech of the event. Moreover, we embed each token to continuous representation by an embedding layer Similar to the work in (Hessel et al., 2019), we build the vocabulary using all tokens in the captioning sentence. The input for each event comprises of three types of embedding: 1) visual feature f (v i ) (and speech text feature f (t i ) if available); 2) position embedding p(v i ) and p(t i ) as introduced in the transformer model (Vaswani et al., 2017); 3) type embedding s(v i ) and s(t i ) representing whether the current embedding is from context or source.
where + is the add operator, E(v i ) and E(t i ) are the embeddings of video and text respectively. For multi-modal input, both visual and text features are concatenated for further processing. We extract two types of contextual information: event-agnostic local context and event-aware global context. Event-agnostic context takes frames temporally close to the video event. Video is a continuous signal and neighboring video frames are likely to be semantically related to each other e.g. same objects. This is especially helpful for recognizing objects with state change or occluded in the current event. Moreover, objects are likely to be explicitly mentioned in the contextual transcript which can be used to deal with object co-reference and ellipsis typically for instructional videos. Event-aware context utilizes the video frames of both previous and future events, which attempts to model the relation between events. The global context provides overall features and prior knowledge of temporal dependency. Specifically for a particular domain like a recipe, the event "mix the flour and water" is often followed by "knead the dough". This prior knowledge of event dependency learned from a global context is effective for understanding long videos.

Hierarchical Context-aware Network
The overall pipeline includes 4 modules: 1) the hierarchical model starts with a local context module (LCM) to encode the local context features, the neighbor video clip temporally close to the event. Specifically, the LCM adopts a flat attention model similar to (Ma et al., 2020) to enhance the source video feature by local context. Besides, given multi-modal inputs, LCM is a general model to fuse both the visual features f (v i ) and the text features f (t i ) inside the event with one unified transformer as in (Hessel et al., 2019); 2) we further employ a global context module (GCM) to make the source event to interact with other event features flexibly. The GCM is a cross attention model, which contains one source encoder SEncoder and one cross encoder CEncoder. SEncoder is a self-attention module for encoding event features, and CEncoder is a cross attention module for interaction between source and context events; 3) the hierarchical context-aware model further combines both the neighbor-clip (around the event) or inter-event (other events) context from both previous and future using gating mechanism; 4) finally, an auto-regressive decoder is used to generate the sentence with a masked transformer model.

Local Context Module
We first introduce the local context module to encode multi-modal source video features together with the event-agnostic context features (surrounding frames). The flat transformer in (Ma et al., 2020) is effective for encoding contextual information with full interaction between source and context features. In addition, when the speech text is available for multi-modal video captioning, this flat encoder can also perform the fusion of visual and text modalities, which is similar to (Hessel et al., 2019). To sum up, we employ one unified flat encoder to accomplish two actions: source-context interaction and multi-modal fusion as explained in Figure 3a.
where [;] is concatenation operation, FFN means the feed-forward network and MultiHead is the multi-head attention network in transformer (Vaswani et al., 2017). We apply residual connection for all components. We only perform equation 5 for multi-modal video event captioning, and E(e i ) is the concatenation of the visual embedding and text embedding for the event i. We then feed the embedding E(e i ) together with the embedding of neighbor frames E(v i±k l ) into the transformer blocks and get context-aware encoding H(m i ), and k l is the local context length. Finally, we only select the output of source encoding instead of using all embedding for further processing. Intuitively, the source is more important than the context. In equation 7, H(e l i ) is the hidden state of the source input, which requires the model to focus on the current source event, i 1 is the start of the event i and i n is the end of the event i. LCM outputs the enhanced event representation by local context and multi-modal inputs.
Global Context Module We then illustrate the global context module to encode the output of LCM together with event-aware context (previous or future events). GCM is a cross attention model, which selectively attends to previous or future events to enhance the source video representation. Different from LCM, which applies a unified transformer to encode a short context, GCM exploits a cross attention model similar to (Maruf et al., 2019) to encode long global context efficiently. The unified transformer model is hard to deal with long input sequences due to complexity. The cross attention model facilitates the source to interact with each context event and can easily be scaled out for long videos. Figure 3b illustrates the GCM model structure.
We exploit the GCM for each contextual event and then combine all the encoding through a context gating mechanism similar to (Maruf et al., 2019). First, the self-attention module encodes each source or context event separately. Then, the cross attention module empowers the source to attend to context.
where H(ê i ) is the encoding of source event i, H(e j ) is encoding of the j-th context event, and H(e c j ) is the source attended to the j-th event. Next, we adopt a gated recurrent unit (GRU) (Cho et al., 2014) to selectively update the source feature with context enhanced feature which is shown to be effective in our ablation study. hj hj where σ is a logistic sigmoid operation, φ is the activation function tanh, w and u are learnable weight matrices, and h j is the encoded representation after the source event i attended to the context event j.
Context Gating We adopt the gate in (Tu et al., 2018) to regulate the source H(e l i ) and context information h j . Then we get the context-enhanced source embedding for further decoding.
where h c is the integration of all previous context h p and future context h f . The w j , w k , w c and w s are learnable parameter matrices, and H is the final representation.

Decoding and Loss
The decoder is an auto-regressive transformer model to generate tokens one by one. We adopt the cross-entropy loss to minimize the negative loglikelihood over ground-truth words and apply the label smoothing strategy.

Dataset and evaluation metrics
We run our experiments on both Youcook2 dataset (Zhou et al., 2018a) and ActivityNet Caption dataset (Krishna et al., 2017). YouCook2 is the task-oriented instructional video dataset for video procedural captioning on the recipe domain. We follow the data partition in VideoBERT (Sun et al., 2019b) which uses 457 videos in the YouCook2 validation set as the testing set and the rest for development. In all, we use 1,278 videos for training and validation. We extract the visual feature by S3D model pre-trained on Howto100M (Miech et al., 2019) dataset through MIL-NCE (Miech et al., 2020) model. This visual representation is a better representation of Howto videos. The ASR transcript is automatically extracted from the off-theshelf recognition tool 1 .
Different from the Youcook2 dataset, Activitynet captions are open-domain videos with overlapping proposals, while Youcook2 has non-overlapping event proposals. We apply the same data partition in (Iashin and Rahtu, 2020b) with the ground truth labels. We directly download the copy of the dataset in (Iashin and Rahtu, 2020b) which contains 9,167 (out of 10,009) training and 4,483 (out of 4,917) validation videos. The dataset only contains partially available videos (91%) due to no longer available Youtube links. To make a fair comparison, we only list the experimental results on the same dataset. This open-source code and data portal contains the speech content extracted from the closed captions (CC) from the YouTube ASR system.

Implementation details
We develop our model based on the open-source code 3 of MDVC (Iashin and Rahtu, 2020b), and will release our code later. The embedding size of video, hidden size of the multi-head, and feedforward layer are 1024, 512, and 128 respectively. The number of the head is 8 and the dropout rate is 0.4. We set the local context length k l as 10, that is, the 10 previous and 10 future frames as a local event-agnostic context, and one previous event and one next event as a global event-aware context for a trade-off between performance and efficiency. We adopt the Adam optimizer (Kingma and Ba, 2015) with learning rate of 1e-4, and set two momentum parameters β 1 = 0.9 and β 2 = 0.98. For label smoothing, and the smoothing rate is 0.4. We set the batch size to 128. For model complexity, the HCN model introduces only 3% more parameters to the base model. All models are trained on 1 Tesla P100 GPUs for 4 hours for Youcook2 and 30 hours for Activitynet Captions. Video features We sampled frames at 16 fps and took the feature activations before the final 3 https://github.com/v-iashin/MDVC linear classifier of the S3D backbone and applied 3D average pooling to obtain a 1024-dimension feature vector. We got 1 feature per second and set k to 80.

Compare with State-of-the-art results
We demonstrate the results of our context-aware model on the Youcook2 dataset in Table 3. There are several existing baseline models: (1) Bi-LSTM with Temporal Attention (Bi-LSTM + TempoAttn) (Shou et al., 2016), which adopts Bi-LSTM language encoder; (2) End-to-End Masked Transformer (EMT) (Zhou et al., 2018b), an transformer based model; (3,4) VideoBERT (Sun et al., 2019b) and Contrastive Bidirectional Transformer (CBT) (Sun et al., 2019a), the pre-training based methods; (5) AT+Video (Hessel et al., 2019), the multimodal transformer method. Besides the work (Shou et al., 2016) using a recurrent network, other baseline methods adopted the transformer model. Our context-aware model achieves the best results for uni-modal video event captioning and outperforms the context-agnostic base model by a large margin. Furthermore, our HCN model with multimodal inputs can achieve comparable results with state-of-the-art results.
We list experimental results on a partial dataset of ActivityNet Captions as (Iashin and Rahtu, 2020b) and ignore others on the full dataset as (Krishna et al., 2017) to make a fair comparison. Table  2 presents the results of baseline methods and HCN. There are several baseline methods: (1) WLT (Rahman et al., 2019), a weakly supervised method with multi-modal input; (2) multi-modal video event captioning (MDVC) (Iashin and Rahtu, 2020b), a transformer-based model with multi-modal inputs; (3) BMT (Iashin and Rahtu, 2020a), a better use of visual-audio information. Among these methods, WLT encoded the context using a recurrent network, while others are transformer models. HCN outperforms the base context-agnostic methods to a large extent and achieves state-of-the-art results.
From both experimental results, we can see that our methods with context-aware information can improve the base context-agnostic model by a large margin for both unimodal or multi-modal input.  We introduce the ablation study of the HCN model on the Youcook2 dataset. In our experiment, we use uni-modal input and illustrate the ablation results in Table 3. We remove one component at a time from the full HCN model to compare the performance. Type embedding: we remove the type embedding which is used to distinguish whether the input is source or context event. From the results, we can observe the performance drop by removing the type embedding. Past/Future context: we investigate the model with the only past context or future context and found that both past and future contexts are effective and complementary with each other. The model with the context in both directions achieves the best result. Cross attention gate: The GRU gate in the cross attention model is more effective than the simple combination, which shows that the GRU gate is better for modeling a sequential context. Local/global context: From the results in Table 3, we can see that the global context is more effective than the local context. The HCN model with both contexts outperforms all the models. Context length. 1) With regards to the local context, the results of 10 or 20 context frames are similar with CIDEr as 141.1 and 141.3 correspondingly, while the performance with 40 frames drops with CIDEr as 138. 2) For the global context, we have increased the number of previous and next events as the global context, but there is no further improvement. We found that irrelevant events even bring noise or duplicated information to learn.

Qualitative Analysis
We analyzed several cases and found two interesting videos shown in Figure 4 and 5. We depict the visual thumbnail, ground-truth caption, predicted results of our baseline and HCN methods. Figure 4: In this case, it is hard to distinguish the fine-grained object "chicken" or "pork" from both visual and the transcript (co-reference "it"). The baseline method would like to predict "chicken" with a prior bias for the ambiguous object leading to inconsistent captions between events. Modeling event dependency can make coherent captions. Besides, as shown in event 1, our HCN model can leverage local context to learn the entity "pork" from previous frames. From the case in Figure 4, we can see that the baseline context-agnostic model generates the caption of each event solely leading to inconsistent captions. The baseline model predicts the ambiguous object as "chicken" for event 1 with prior bias, but output the object as "pork" for event 2. Our HCN model can tackle this issue and is prone to predict captions with a consistent object in the procedure. Besides, as shown in event 1, the entity "pork" can also be learned from previous frames. The context-aware model is effective in resolving entity ambiguity and generating coherent captions.
The case in Figure 5 presents another challenge. Since the visual cue of the three events is very similar, the base context-agnostic model inevitably predicts the same caption as "knead the dough". The HCN model can learn the prior dependency between events, and hinder generating redundant sentences for similar events in the video. Therefore, the HCN model can generate the correct sentence for event 3. However, although the model tries to predict different captions for event 1, it is still hard to recognize the fine-grained entity "salt" from the video, and all models predict the object by mistake. Fine-grained entity recognition from a video is still a challenging problem.
To sum up, from these cases we can see that, 1) the neighboring context can provide extra information to make an accurate and coherent prediction.
2) the HCN model can capture the temporal dependency between events as prior knowledge, and generate consistent and less duplicate captions between events. 3) fine-grained object recognition from a video is still a challenging problem. Visual coreference resolution (Kottur et al., 2018) can be the future work to tackle this problem.

Related Work
Video Captioning The tasks mainly contain three types of captioning: single-sentence captioning Wang et al., 2018b;, paragraph-level captioning (Yu et al., 2016;Lei et al., 2020;Ging et al., 2020) and event-level captioning (Krishna et al., 2017;Li et al., 2018;Wang et al., 2018a;Mun et al., 2019;Zhou et al., 2018b). The difference between these tasks is whether to generate one or multiple sentences for the whole video or each separate event of the video. In this paper, we focus on the more challenging dense event-level video captioning task to generate descriptions for each event. Previous works (Krishna et al., 2017;Li et al., 2018;Wang et al., 2018a) mainly exploited recurrent neural models such as long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997) or recurrent unit (GRU) (Cho et al., 2014) to encode context. However, the recurrent model suffers from modeling long dependency effectively. Zhou et al. Sun et al., 2019b,a) introduced a self-attention model (Vaswani et al., 2017) which generates the caption based on the clip of each event solely. Compared with these works, we are the first to implement a novel video-level hierarchical context-aware network for dense video event captioning.
Multi-modal Video Captioning Video natu-rally has multi-modal inputs including visual, speech text, and audio. Previous works explore visual RGB, motion, optical flow features, audio features (Hori et al., 2017;Wang et al., 2018b;Rahman et al., 2019) as well as speech text features (Shi et al., 2019;Hessel et al., 2019;Iashin and Rahtu, 2020b) for captioning. According to the work in (Shi et al., 2019;Hessel et al., 2019;Iashin and Rahtu, 2020b), although the speech text is noisy and informal, it can still capture better semantic features and improve performance especially for instructional videos. Later on, Lashin et al. (Iashin and Rahtu, 2020b) proposed to embed all visual, audio, and speech text for dense video event captioning. However, context-aware models are rarely investigated in multi-modal video event captioning. Therefore, we propose a novel attention model for effectively encoding the local and global context to tackle ambiguous object recognition and transcript co-reference through jointly modeling multi-modal inputs.
Context-aware Language Generation Our work is inspired by context-aware language generation e.g. document-level neural machine translation (NMT) (Miculicich et al., 2018;Maruf et al., 2019;Ma et al., 2020). Miculicich et al. (Miculicich et al., 2018) adopted a hierarchical context-aware network in a structured and dynamic manner. Marcuf et al. (Maruf et al., 2019) and Ma (Ma et al., 2020) further explored a scalable and effective attention mechanism. For the local neighbor-clip and global inter-event context, we further design a hierarchical context-aware network with a hybrid mechanism of multi-modal video captioning to dynamically leverage various video-level information through a gating scalar.

Conclusion and Discussion
Dense video event captioning is a typical video understanding task to learn procedural events in a long untrimmed video. It is essential to model holistic video information for event understanding. In this paper, we propose a novel hierarchical context-aware network to encode both the local and global context of long videos. Our HCN model is effective in modeling context and outperforms the context-agnostic model by a large margin.
In future work, we tend to extend our hierarchical network to further investigate how to effectively attend to the long context to filter ambiguous and irrelevant information.