Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question’s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.


Introduction
Recently, research in natural language processing and computer vision has made significant progress in artificial intelligence (AI). Thanks to this, visionlanguage tasks such as image captioning (Xu et al., 2015), visual question answering (VQA) (Antol et al., 2015;Goyal et al., 2017), and visual commonsense reasoning (VCR) (Zellers et al., 2019) have been introduced to the research community, along with some benchmark datasets. In particular, video question answering (video QA) tasks (Xu et al., 2016;Jang et al., 2017;Yu et al., 2019;Choi et al., 2020) have been proposed with the goal of reasoning over higher-level visionlanguage interactions. In contrast to QA tasks based on static images, the questions presented in the video QA dataset vary from frame-level questions regarding the appearance of objects (e.g., what is the color of the hat?) to questions regarding action and causality (e.g., what does the man do after opening a door?).
There are three crucial challenges in video QA: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, and causality), and (3) crossmodal grounding between language and vision information. To tackle these challenges, previous studies Huang et al., 2020) have mainly explored this task by jointly embedding the features from the pre-trained word embedding model (Pennington et al., 2014) and the object detection models . However, as discussed in (Gao et al., 2018), the use of the visual features extracted from the object detection models suffers from motion analysis since the object detection model lacks temporal modeling. To enforce the motion analysis, a few approaches (Xu et al., 2017;Gao et al., 2018) have employed additional visual features (Tran et al., 2015) (i.e., motion features) which were widely used in the action recognition domain, but their reasoning capability is still limited. They typically employed recurrent models (e.g., LSTM) to embed a long sequence of the visual features. Due to the problem of long-term dependency in recurrent models (Bengio et al., 1993), their proposed methods may fail to learn dependencies between distant features.
In this paper, we propose Motion-Appearance Synergistic Networks (MASN) for video question answering which consist of three kinds of modules: the motion module, the appearance module, and the motion-appearance fusion module. As shown in Figure 1, the motion module and the appearance module aim to embed rich cross-modal representations. These two modules have the same architecture except that the motion module takes the motion features extracted from I3D as visual features and the appearance module utilizes the appearance features extracted from ResNet. Each of these modules first constructs the object graphs via graph convolutional networks (GCN) to compute the relationships among objects in each visual feature. Then, the vision-question interaction module performs cross-modal grounding between the output of the GCNs and the question features. The motion module and the appearance module each yield cross-modal representations of the motion and the appearance aspects of the input video respectively. The motion-appearance fusion module finally integrates these two features based on the question features. The main contributions of our paper are as follows. First, we propose Motion-Appearance Synergistic Networks (MASN) for video question answering based on three modules, the motion module, the appearance module, and the motionappearance fusion module. Second, we validate MASN on the large-scale video question answering datasets TGIF-QA, MSVD-QA, and MSRVTT-QA.
MASN achieves the new state-of-the-art performance on TGIF-QA and MSVD-QA. We perform ablation studies to validate the effectiveness of our proposed methods. Finally, we conduct a qualitative analysis of MASN by visualizing inference results.

Related Work
Visual Question Answering (VQA) is a task that requires both understanding questions and finding clues from visual information. VQA can be classified into two categories based on the type of the visual source: image QA and video QA. In image QA, earlier works approach the task by applying attention between the question and the spatial dimensions of the image (Yang et al., 2016;Anderson et al., 2018;Kim et al., 2018a;Kang et al., 2019). In video QA, since a video is represented as a sequence of images over time, recognizing the movement of objects or causality in the temporal dimension should also be considered along with the details from the spatial dimension (Jang et al., 2017;. There have been some attempts (Xu et al., 2017;Gao et al., 2018; to extract motion and appearance features and integrate them on a spatio-temporal dimension via memory networks. Li et al. (2019), Huang et al. (2020),  proposed better performing models using attention in order to overcome the long-range dependency problem in memory networks. However, they do not represent motion in-formation sufficiently since they only use features pre-trained on image or object classification. To better address this, we model spatio-temporal reasoning on multiple visual information (i.e., ResNet, I3D) while also solving the long-range dependency problem that occurred in previous studies. Action Classification is a task of recognizing actions, which are composed of interactions between actors and objects. Therefore, this task has much in common with video QA, in that the model should perform spatio-temporal reasoning. For better spatio-temporal reasoning, Tran et al. (2015) introduced C3D, which extends the 2D CNN filters to the temporal dimension. Carreira and Zisserman (2017) proposed I3D, which integrates 3D convolutions into a state-of-the-art 2D CNN architecture, which now acts as a baseline in action classification tasks (Murray et al., 2012;Girdhar et al., 2018). Feichtenhofer et al. (2019) introduced SlowFast, a network which encodes images in two streams with different frame rates and temporal resolutions of convolution. This study based on a two-stream architecture inspired us in terms of assigning different inputs to each encoder module. However, our method differs from the former studies in two aspects: (1) we utilize language features as well as vision features, and (2) we expand the two-stream structure to solve more than motion-oriented tasks. Attention Mechanism explicitly calculates the correlation between two features (Bahdanau et al., 2015;Lin et al., 2017), and has been widely used in a variety of fields. For machine translation, the Transformer architecture first introduced by Vaswani et al. (2017), utilizes multi-head selfattention that captures diverse aspects in the input features (Voita et al., 2019). For video QA, Kim et al. (2018b); Li et al. (2019) use self and guidedattention to encode temporal dynamics in video and ground them in the question. For multi-modal alignment, Tsai et al. (2019) apply the Transformer to merge cross-modal time series between vision, language, and audio features. We utilize the attention mechanism to capture various relations between appearance and motion and to aggregate them.

Model
In this section, we introduce a detailed description of our MASN network. First, we explain how to obtain appearance and motion features in Section 3.1. Then, we describe the Appearance and Motion modules, which encode visual features and com-bine them with the question in Section 3.2. Finally, the Motion-Appearance Fusion module modulates the amount of motion and appearance information utilized and integrates them based on question context.

Visual and Linguistic Representation
We first extract appearance and motion features from the video frames. For the appearance representation, we use ResNet  pre-trained on an object and its attribute classification task as a feature extractor. For the motion representation, we use I3D (Carreira and Zisserman, 2017) pretrained on the action classification task. We obtain local features representing object-level information without background noise and global features representing each frame's context for both appearance and motion features.
Appearance Representation. For local features, given a video containing T frames, we obtain N objects from each frame using Faster R-CNN (Ren et al., 2016) that applies RoIAlign to extract the region of interest from ResNet's convolutional layer. We denote the appearance-object set as R a = {o a t,n , b t,n } t=T,n=N t=1,n=1 , where o, b indicate object feature and bounding box location, respectively. Therefore, there are K = N × T objects in a single video. Following previous works, we extract the feature map from ResNet-152's Conv5 layer and apply a linear projection Huang et al., 2020). We denote global features as v a global ∈ R T ×d , where d is the size of the hidden dimension.
Motion Representation. We obtain a feature map from the last convolutional layer in I3D (Carreira and Zisserman, 2017) whose dimension is (time, width, height, feature) = ( t 8 , 7, 7, 2048). That is, each set of 8 frames is represented as a single feature map with dimension 7 × 7 × 2048. For local features, we apply RoIAlign (He et al., 2017) on the feature map using object bounding box location b. We define the motion-object set as R m = {o m t,n , b t,n } t=T,n=N t=1,n=1 . We apply average pooling in the feature map and linear projection to obtain global features v m global ∈ R T ×d .
Location Encoding. To reason about relations between objects as in Section 3.2, it is required to consider each object's spatial and temporal location. As appearance and motion features share identical operations until the Motion-Appearance Fusion module, we combine superscript a and m for simplicity. Following L-GCN (Huang et al., 2020), we add a location encoding and define local features as: where d s = FFN(b) and d t is obtained by position encoding according to each frame's index.
Here o a/m denotes the object features mentioned above while FFN denotes a feed-forward network. Analogous to local features, position encoding information d t is added to global features as well.
We then concatenate object features with global features to reflect the frame-level context in objects and obtain the visual Linguistic Representation. We apply the pretrained GloVe to convert each question word into a 300-dimensional vector, following previous work (Jang et al., 2017). To represent contextual information in a sentence, we feed the word representations into a bidirectional LSTM (bi-LSTM). Word-level features and last hidden units from the bi-LSTM are denoted by F q ∈ R L×d , and q ∈ R d respectively.
L denotes the number of words in a question.

Motion and Appearance Module
In this section, we explain the modules generating high-level visual representations and integrate them with linguistic representations. Each module consists of (1) an Object Graph: spatio-temporal reasoning between object-level visual features, and (2) VQ interaction: calculating correlations between objects and words and obtaining cross-modal feature embeddings. Since the modules share the same architecture, we describe each module's components only once with a shared superscript to avoid redundancy.

Object Graph Construction
In this section, we define object graphs G a/m = (V a/m , E a/m ) to capture spatio-temporal relations between objects. V, E denotes the node and edge set of the graph. As equation 2 provides visual features v a/m , we define these as the graph input X a/m ∈ R K×d . We denote the graph as G a/m . The nodes of graph G a/m are given by v a/m i ∈ X a/m , and edges are given by (v a/m i , v a/m j ), representing a relationship between the two nodes. Given the constructed graph G, we perform graph convolution (Kipf and Welling, 2016) to obtain the relationaware object features. We obtain the similarity scores of nodes by calculating the dot-product after projecting input features to the interaction space and define the adjacency matrix A a/m ∈ R K×K as follows: We denote the two-layer graph convolution on input X with adjacency matrix A as: (4) We omit superscripts in the graph convolution equation for simplicity. We add a skip connection for residual learning between self-information X and smoothed-information with neighbor objects.

Vision-question (VQ) Interaction
We compute both appearance-question and motionquestion interaction to obtain correlations between language and each of the visual features. As we encode visual feature F a/m and question feature F q in Equation 4 and Section 3.1, we calculate every pair of relations between two modalities using the bilinear operation introduced in BAN ( Kim et al., 2018a) as follows: where H 0 = F q , 1 ∈ R L , 1 ≤ i ≤ g and A denotes the attention map. F a/m is substituted for V respectively in our method. In the equation above, calculating the result BAN(H, V ; A) ∈ R d and adding it to the H is repeated in g times. Afterwards, H represents the combined visual and language features in the question space incorporating diverse aspects from the two modalities (Yang et al., 2016).

Motion-Appearance Fusion
In this section, we introduce the Motion-Appearance Fusion module which is our key contribution. Depending on what the question ultimately asks about, the model is supposed to decide which features are more relevant among appearance and motion information, or a combination of both. To do this, we produce appearance-centered, motion-centered, and all-mixed features and aggregate them depending on question context. Based on the previous step, we obtain cross-modal combined features H a and H m in terms of appearance and motion. We concatenate these two matrices and define U as: Motion-Appearance-centered Attention. We first define regular scaled dot-product attention to attend features to diverse aspects: where Q, K, V denotes the query, key, and value, respectively. To obtain motion-centered, appearance-centered and mixed attention, we substitute U with the query, and H a , H m , U with the key and value in the equation 7 as: where P ∈ R 2L×d and Z ∈ R 2L×d . As in the first line of the equation 8, we add projected appearance features P a on each appearance and motion feature to obtain Z a , since the matrix U is the concatenation of H a and H m . Therefore, we argue that Z a contains appearance-centered information. Similarly, Z m/all contains motioncentered and all-mixed features, respectively. We argue that the Motion-Appearance-centered attention fuses appearance and motion features in various proportions and these three matrices work like multi-head attention sharing the task of capturing diverse information, and become synergistic when combined.
Question-Guided Fusion. For question-guided fusion, we first define z a/m/all as the sum of matrix Z a/m/all ∈ R 2L×d over sequence length 2L. We obtain attention scores between each z a/m/all and question context vector q: where q denotes the last hidden vector. The attention score α a/m/all can be interpreted as the importance of each matrix Z based on question context. We obtain the question-guided fusion matrix O as: where O ∈ R 2L×d is obtained by linear transformation and a residual connection after weighted sum. We aggregate information by attention over the sequence length of O: The final output vector f ∈ R d is used for answer prediction.

Answer Prediction and Loss Function
The video QA task can be divided into counting, open-ended word, and multiple-choice tasks (Jang et al., 2017). Our method trains the model and predicts the answer based on the three tasks similar to previous work. The counting task is formulated as a linear regression of the final output vector f . We obtain the final answer by rounding the result and we minimize Mean Squared Error (MSE) loss.
The open-ended word task is essentially a classification task over the whole answer set. We calculate a classification score by applying a linear classifier and softmax function on the final output f and train the model by minimizing cross-entropy loss.
For the multiple-choice task, like in previous work (Jang et al., 2017), we attach an answer to the question and obtain M candidates. Then, we obtain the score for each of the M candidates by a linear transformation to the output vector f . We minimize the hinge loss within every pair of candidates, max(0, 1 + s n − s p ), where s n and s p are scores from incorrect and correct answers respectively.

Experiments
In this section, we evaluate our proposed model on three Video QA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA. We first introduce each dataset and compare our results with the state-of-the-art methods. Then, we report ablation studies and include visualizations to show how each module in MASN works.

Datasets
TGIF-QA (Jang et al., 2017) is a large-scale dataset that consists of 165K QA pairs collected from 72K animated GIFs. The length of video clips is very short, in general. TGIF-QA consists of four types of tasks: Count, Action, State transition (Trans.), and FrameQA. Count is an open-ended question to count how many times an action repeats. Action is a task to find action repeated at certain times, and Transition aims to identify a state transition over time. Both types are multiple-choice tasks. Lastly, FrameQA is an open-ended question that can be solved from just one frame, similar to image QA. (Xu et al., 2017) are automatically generated from video descriptions. It consists of 1,970 video clips and 50K and 243K QA pairs, respectively. The average video lengths are 10 seconds and 15 seconds respectively. Questions belong to five types: what, who, how, when, and where. The task is open-ended with a pre-defined answer sets of size 1,000 and 4,000, respectively.

Implementation Details
We first extract frames with 6 fps for all datasets.
In the case of appearance features, we sample 1 frame out of 4 to avoid information redundancy. We apply Faster R-CNN (Ren et al., 2016) pretrained on Visual Genome (Krishna et al., 2017) to obtain local features. The number of extracted objects is N = 10. For global features, we use ResNet-152 pre-trained on ImageNet (Deng et al., 2009). In the the case of motion features, we apply I3D pre-trained on the Kinetics action recognition dataset (Kay et al., 2017). For the input of I3D, we concatenate a set of 8 frames around the sampled frame mentioned above. In terms of training details, we employ Adam optimizer with learning rate as 10 −4 . The number of BAN glimpse g is 4. We set the batch size as 32 for the Count and FrameQA tasks and 16 for Action and Trans. tasks.

Comparison with State-of-the-arts
We compare MASN with state-of-the-art (SoTA) models on the aforementioned datasets.
TGIF-QA. Compared with ST-VQA (Jang et al., 2017), Co-Mem (Gao et al., 2018), PSAC , STA , HME , and recent SoTA models: HGA, L-GCN, QueST, HCRN (Jiang and Han, 2020;Huang et al., 2020;Le et al., 2020), MASN shows the best results for three tasks: Count, Trans., and Action, outperforming the baseline methods by a large margin as shown in Table 1. In the case of FrameQA, the performance is similar to QueST. However, considering that there exists some tradeoff between the performance of Count and FrameQA since Count focuses on identifying temporal information and FrameQA focuses on spatial information, MASN shows the best overall performance on the entire task.
MSVD-QA & MSRVTT-QA. As shown in Table 2, MASN outperforms the best baseline methods, QuesT and HCRN by approximately 2% on MSVD-QA, and shows competitive results on MSRVTT-QA. Since these datasets are composed of wh-questions, such as what or who, the question sets seemingly resemble FrameQA in TGIF-QA, as they tend to focus on spatial appearance features. This means that MASN is able to capture spatial details well based on the spatiotemporally mixed features.

Ablation Study
Analyzing the impact of motion module and appearance module. We investigate the effect of each module as seen in Figure 1. In Table 3, the 1 st and 2 nd row represent the result of using only the Appearance and Motion module, respectively. The 3 rd row shows the result of just concatenating appearance and motion features from each module and flattening them, by substituting the input X for O in equation 11. Most existing SOTA models utilize only ResNet features for spatio-temporal reasoning based on the difference of vectors over time. Using only the Appearance module is similar to most of these existing methods, which can catch spatio-temporal relations relatively well. On the other hand, we found that the accuracy on FrameQA when only using the Motion module is about 7% lower than when using the Appearance module. This means the Motion module is limited in its ability to capture the appearance details. However, comparing the 1 st and 3 rd row in Table  3, the performance in the Action and Trans. tasks increase consistently when the Motion module is added compared to using only the Appearance module. This indicates that the Motion module is a meaningful addition. Lastly, compared to the 1 st , 2 nd and 3 rd row, when integrating the output from both modules there is a further overall performance improvement. This indicates a synergistic effect occurs when integrating both the appearance and motion feature after obtaining them as high-level features.
Analyzing the impact of fusion module. We show ablation studies inside the fusion module represented in Table 3. The 4 th row indicates the performance of our proposed MASN architecture. The results in the 'Single-Attention Fusion' row use only one type of attention among appearancecentered, motion-centered, and all-mixed as seen in equation 8. The results in the 'Dual-Attention Fusion' row utilize two among the three types of attention mentioned above. Due to the nature of video, when a question such as "How many times does the man in the white shirt put his hand on the head?" is given, the model is supposed to find the motion information "put" while catching the appearance information "man in white shirt" or "hand on head", and finally mixing them in different proportions depending on the context of question. Comparing the result of the 3 rd (without fusion) row and MASN first, MASN shows better performance across tasks. This means mixing appearance and motion features in various proportions using the Motion-Appearance-centered Fusion module and computing the weighted fusion via the Question-Guided Fusion module contributes to the performance. When comparing the general performance with the number of attention types in fusion module, using single, dual, and triple attention (MASN) shows increasingly better performance in the same order. This indicates that focusing on different aspects and integrating each attended feature performs better than calculating attention at once. Additionally, comparing the result of using only appearance or motion-centered attention in 'Single' with both of them in 'Dual', we find that using both features shows better performance, which means they play complementary roles for each other. Similarly, we argue the reason for the performance increase in FrameQA in the 'Motion' row of 'Single-Att. Fusion' is due to the fact that the model can find relevant appearance information better based on motion information.

Qualitative Results
We give examples of each attention score matrix from Motion-Appearance Fusion module in Figure  3. We draw two conclusions from the Figure: (1) each attention map catches different relations similarly to multi-head attention, (2) each attention map is used to a different extend depending on the type of task. For example, in FrameQA, the appearancecentered's attention map captures which appearance trait to find focusing on 'how many'. On the other hand, the motion-centered's and all-mixed's attention map attend on 'waving' or 'hands' to catch motion-related information. In Action, similar to FrameQA, the appearance-centered's attention map attends on 'head' which is the object of action, while the motion-centered's attention map catch 'nod' which is related to movement. However, in the case of the Count task, the two attention weights are not as sparse as scores in the other tasks. We think this dense attention map causes the inconsistency in the performance increase between Count task and Action and Trans. task, although questions for all of these three tasks ask for motion information.

Conclusion
In this paper, we proposed a Motion-Appearance Synergistic Networks to fuse and create a synergy between motion and appearance features. Through the Motion and Appearance modules, MASN manages to find motion and appearance clues to solve the question, while modulating the amount of information used of each type through the Fusion module. Experimental results on three benchmark datasets show the effectiveness of our proposed MASN architecture compared to other models.