Multi-Scale Progressive Attention Network for Video Question Answering

Understanding the multi-scale visual information in a video is essential for Video Question Answering (VideoQA). Therefore, we propose a novel Multi-Scale Progressive Attention Network (MSPAN) to achieve relational reasoning between cross-scale video information. We construct clips of different lengths to represent different scales of the video. Then, the clip-level features are aggregated into node features by using max-pool, and a graph is generated for each scale of clips. For cross-scale feature interaction, we design a message passing strategy between adjacent scale graphs, i.e., top-down scale interaction and bottom-up scale interaction. Under the question’s guidance of progressive attention, we realize the fusion of all-scale video features. Experimental evaluations on three benchmarks: TGIF-QA, MSVD-QA and MSRVTT-QA show our method has achieved state-of-the-art performance.


Introduction
Video Question Answering (VideoQA) is a popular vision-language task, which focuses on predicting the correct answer to a given natural language question based on the corresponding video. VideoQA task entails representing video features in both spatial and temporal dimensions. Compared with the visual features of a picture in Visual Question Answering, it requires a more complex attention.
Therefore, (Jang et al., 2017) employed appearance features and motion features as video representation, and designed a dual-LSTM network based on spatio-temporal attention to fuse visual and text information. Next, memory networks are widely used to capture long-term dependencies. For example, (Cai et al., 2020) applied feature augmented memory to strengthen the information augmentation of video and text. Complex relational reasoning is important for VideoQA task. Consequently, a conditional relationship network (Le et al., 2020) Question: who is cleaning in a kitchen while wearing gloves ? Answer: woman time Video: two-frame clip one-frame clip three-frame clip was designed in previous work, which can support high-order relationships and multi-step reasoning. Many methods complete this task from a certain aspect, however, none of them have a fine-grained understanding of video information. When looking for the answer in a question-based video, the video frames corresponding to different objects in the question are of different lengths. As shown in Fig.  1, when asked "who is cleaning in a kitchen while wearing gloves?", we need to find the keywords "cleaning", "a kitchen" and "wearing gloves" from different levels of clips. Previous methods searched for the answer on the same level of clips in a video, leading to insufficient or redundant information.
Firstly, we construct clips of different lengths from the frame sequence, and regard the length of a clip as its scale information. Then, multi-scale graphs are generated separately for clips of different scales. The nodes in the multi-scale graphs indicate video features corresponding to different clips. For implementing relational reasoning, the nodes in each scale graph are first updated by using graph convolution. Most importantly, under the guidance of the question, progressive attention has been utilized to enable the fusion of multi-scale features during cross-scale graph interaction. In detail, each graph is gradually updated in top-down scale order, followed by updating each graph in bottom-up scale order. Finally, node features of a graph are fused with question embedding, and a classifier is employed to find the answer.

Method
An overview of the proposed MSPAN is shown in Fig. 2. The input is a short video and a question sentence, while the output is the produced answer.

Video and Question Representation
Video representation N frames are uniformly sampled to represent the video. Then we use the pre-trained ResNet-152 (He et al., 2016) to extract video appearance features for each frame. And, we apply the 3D ResNet-152 (Hara et al., 2018) pre-trained on Kinetics-700 (Carreira et al., 2019) dataset to extract video motion features. Specifically, 16 frames around each frame are placed into the 3D ResNet-152 to obtain the motion features around this frame. Finally, we get a joint video representation by concatenating appearance features and motion features. By using a fully-connected layer to reduce feature dimension, we obtain video representation as Question representation All words in question are represented as 300-dimensional embeddings initialized with pre-trained GloVe vectors (Pennington et al., 2014). And a 512-dimensional question embedding is generated from the last hidden state of a three-layer BiLSTM, i.e., q ∈ R 512 .

Multi-Scale Graphs Generation
Each object in the video corresponds to a different number of frames, but previous methods (Seo et al., 2020;Lei et al., 2021) cannot treat various levels of visual information separately. Therefore, we construct clips of different lengths to express the visual information in the video delicately, and regard the length attribute as a scale.
We use max-pools of different kernel-sizes to aggregate frame-level visual features, and kernelsize is the scale attribute of these clips. In this way, clip-level visual features are obtained, as follows: Where K is the range of scales, and K ≤ N . Thus, we construct M i = N − i + 1 clips at scale i: In order to reason the relationships between different objects in a video, we separately build a graph for each scale. Each node in a graph represents the clip-level visual features. Only when two nodes contain overlapping or adjacent frames, an edge will be connected between them. Frame interval of the j-th clip at scale i is [j, j + i − 1], so all edges in the K graphs can be expressed as: Finally, these multi-scale graphs constructed in this paper can be denoted as

Cross-Scale Feature Interaction
Before cross-scale feature interaction, the original node features of K graphs are copied as V o i = V i . Interaction at the same scale. For all nodes with the same scale, we apply a two-layer graph convolutional network (GCN) (Kipf and Welling, 2017) to perform relational reasoning over the K graphs. The process of graph convolution is represented as: WhereÂ is the input adjacency matrix, X l is the node feature matrix of layer l, and W l is the learnable weight matrix. The diagonal node degree ma-trixD is used to normalizeÂ. Due to the small number of nodes in each graph, we decide to share the weight matrix W l when K graphs are updated.
Interaction at top-down scale.
We realize the interaction of adjacent scale graphs from small scale to large scale. Therefore, visual information is understood step by step from details to the whole through the interaction of top-down scale. Guided by the question, the nodes in graph G i are used to update the nodes in graph G i+1 . Visual features at different scales show hierarchical attention to the question, so we call it progressive attention.
If the clip corresponding to node x in graph G i has the same frames as the clip corresponding to node y in graph G i+1 , there will exist a directed edge from x to y. Therefore, we can use the edge to fuse the cross-scale features of these same frames.
Firstly, visual features and question embedding are fused to capture the joint features of each node in graph G i . Then, the process of message passing from graph G i to graph G i+1 can be expressed as: Where ⊗ is the outer product, is the hadamard product. After receiving the delivery messages, the attention weights of these messages are calculated: Where N y is the set of all neighbor nodes in graph G i through cross-scale edges. Consequently, all the messages passed into node y are summed to derive the update of node y, as follows: When updating all nodes in graph G i+1 , we consider the new features V u i+1 and the original features V o i+1 . Therefore, we use the residual connection to preserve original information of the video: Where [; ] is the concatenation operator. Above W 1 ∼ W 6 are learnable weights, and they are shared in the update of graphs G 2 ∼ G K . To summarize, the update of K − 1 graphs is a progressive process from small scale to large scale, hence it is referred to as top-down scale interaction.
Interaction at bottom-up scale.
After an overall understanding of the video, people can accurately find all details related to the question at the second time they watch the video. Therefore, we achieve an understanding of the video from global to local through bottom-up scale interaction. After the previous interaction, we realize the interaction of adjacent graphs from large scale to small scale.
Following the same method as top-down scale interaction from Eq. 6 to Eq. 10, we apply graph G i to update graph G i−1 in this interaction. But the weights W 1 ∼ W 6 are another group in the update of graphs G K−1 ∼ G 1 . After this interaction, graph G 1 can grasp the all-scale video features related to the question by progressive attention.

Multimodal Fusion and Answer Decoder
After T iterations of cross-scale feature interaction, we read out all the nodes in graph G 1 . Then, a simple attention is used to aggregate the N nodes. And, final multi-modal representation is given as: Where ELU is activation function, above W 7 ∼ W 11 are learnable weights and b is learnable bias. We can find the answer by applying a classifier (two fully-connected layers) on multi-modal representation F . Multi-label classifier is applied to open-ended tasks, and cross-entropy loss function is used to train the model. Due to repetition count is a regression task, we use the MSE loss function. For the multi-choice task, each question corresponds to R answer sentences. We first get the embedding of each answer in the same way as the question embedding. Then we use the multi-modal fusion method in Eq. 11∼13 to fuse the answer embedding with node features. After using two fully-connected layers, the answer scores have appeared. This model is trained by minimizing the hinge loss (Jang et al., 2017) (Xu et al., 2017) and MSRVTT-QA (Xu et al., 2016) are open-ended tasks which are generated from video descriptions. In both datasets, questions can be divided into 5 types according to question words: what, who, how, when and where.

Implementation Details
We evenly sample N = 16 frames for each video in the three datasets. The hyperparameters we set in experiments are as follows: T = 3, K = 8. When training the network, Adam is used with an initial learning rate of 10 −4 . For TGIF-QA dataset, the batch size is 64. While the batch size is set to 128 for both MSVD-QA and MSRVTT-QA datasets.

Results
We compare our MSPAN with the state-of-the-art methods: PSAC (Li et al., 2019), HME (Fan et al., 2019), FAM (Cai et al., 2020), LGCN (Huang et al., 2020), HGA (Jiang and Han, 2020), QueST  and HCRN (Le et al., 2020). Results on TGIF-QA. As shown in Table 1, our method outperforms the state-of-the-art methods by 2.5% and 1.9% of accuracy on Action and Transition tasks. For the Count task, our method also achieves the best Mean Square Error (MSE) of 3.57 among all methods. Due to QueST used multidimension visual features containing more appearance information, our method can only get the same accuracy 59.7% as QueST on the FrameQA task.  All in all, our method makes sense of the multiscale information of the video, so that the effect on tasks related to action recognition, temporal relationship and object count are very noticeable.
Results on MSVD-QA. As shown in Table 2, our method improves the overall accuracy by 4.2% compared to recent methods. We have achieved the best accuracy on questions whose question words are "What" , "Who", "When" and "Where". Due to a small proportion, the accuracy on the question word "How" is lower than other methods.
Results on MSRVTT-QA. As shown in Table  3, our method achieves the best overall accuracy of 37.8%. What's more, Our method could obtain excellent accuracy on different question words.

Ablation Studies
To explore the potential of our network, ablation experiments are performed on TGIF-QA dataset. Default hyperparameters are: T = 3 and K = 8. We study the effectiveness of our network in the next two aspects, as shown in Table 4 and Fig. 4.

Different Structures
Considering the interaction of cross-scale graphs, three structures are designed, as shown in Fig. 3. For the dense scale in Fig. 3 (a), we apply graphs G 1 ∼ G K to update each graph G i . The other two structures have been introduced in Sec 2.3, and we will not use a graph to update itself for the three structures. The readout of top-down scale interaction is graph G K , and the readout of bottom-up scale interaction is G 1 . However, the readout of dense scale interaction is all K graphs. Our network is a combination of top-down scale interaction and bottom-up scale interaction, but we will use these two structures separately for comparison.

Network structure
When choosing the pooling function to aggregate these frames in a clip, we find that max-pool is more effective than avg-pool. In reverse gradient propagation of max-pool, only the maximum of features in the previous layer receive the gradient. So, max-pool facilitates the fusion of appearance features and motion features in the previous layer. Our experiments show that GCN is beneficial to the stable training of models. If there is no GCN, the gradient will gradually disappear as the number of interactions between the graphs increases. The role of GCN is to re-recover the features of these nodes which have lost their visual features.
As shown in Table 4, the performances of the three structures in Fig. 3 are poorer than that of our entire network. Due to dense connections between all scale graphs, the dense scale interaction will add much unnecessary computation, and make it difficult to accurately find the visual information related to the question. Although both the top-down scale interaction and the bottom-up scale interaction can achieve good performance. However, the combination of these two structures will obtain a more detailed understanding of the video.

Hyperparameters T and K
As the number of iterations T increases, the model will achieve better performance. But when T = 4, the effect of the model decreases, as shown in Table 4. Because too many modules will produce noise for answer generation. The improvement  of models with the increase of K is very obvious, and best performance is obtained when K = 8, as shown in Fig. 4. However, the larger K also means that many multi-scale graphs, which will lead to network instability.

Conclusion
We introduce a multi-scale learning method to achieve a fine-grained understanding of the video. Compared with existing spatio-temporal attention, we use progressive attention to realize cross-scale feature interaction. The top-down and bottom-up structures we have designed are conducive to learning all-scale visual information of the video. For longer videos, we plan to use dilated max-pools with different strides to reduce the size of graphs. In general, we consider the VideoQA task from the perspective of multi-scale information interaction, and the proposed network is instructive.