Video-guided Machine Translation with Spatial Hierarchical Attention Network

Video-guided machine translation, as one type of multimodal machine translations, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pretrained action detection models as motion representations of the video to solve the verb sense ambiguity, leaving the noun sense ambiguity a problem. To address this problem, we propose a video-guided machine translation system by using both spatial and motion representations in videos. For spatial features, we propose a hierarchical attention network to model the spatial information from object-level to video-level. Experiments on the VATEX dataset show that our system achieves 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method.


Introduction
Neural machine translation (NMT) models relying on text data (Bahdanau et al., 2015;Wu et al., 2016) have achieved high performance for domains where there is less ambiguity in data such as the newspaper domain. For some other domains, especially real-time domains such as spoken language or sports commentary, the verb and the noun sense ambiguity largely affects the translation quality. To solve the ambiguity problem, multimodal machine translation (MMT)  focuses on incorporating visual data as auxiliary information, where the spatiotemporal contextual information in the visual data helps reduce the ambiguity of nouns or verbs in the source text data (Barrault et al., 2018).
Previous MMT studies mainly focus on imageguided machine translation (IMT) task (Zhao et al., 2020;. However, videos are better information sources than images because one Source: An apple picker takes apples from the trees and places them in a bin. Translation: ⼀ 个 苹 果 苹 果 从 树 上 摘 下 苹 果 ， 然 后 把 它 们 放 在 ⼀ 个 垃 圾 桶 ⾥ 。( An apple apple takes apples from the trees and places them in a trash bin.) video contains an ordered sequence of frames and provides much more visual features. Specifically, each frame provides spatial representations for the noun sense disambiguation as an image in IMT task. Besides the noun sense disambiguation provided by one frame, the ordered sequences of frames can provide motion representations for the verb sense disambiguation.
The research of video-guided machine translation (VMT) starts from a large-scale video-andlanguage-research dataset (VATEX) (Wang et al., 2019). The authors also established a baseline using features from pretrained action detection models as motion representations of the video, which addresses the verb sense ambiguity to some extent, leaving noun sense ambiguity unsolved. Hirasawa et al. (2020) aims to solve both the verb and noun sense ambiguity problems by using frame-level action, object, and scene representations. However, without using detailed spatial information within one frame and contextual information between frames, the effect of resolving the noun ambiguity problem is limited. For example, as shown in Figure 1, the noun "bin" in English is wrongly translated into "trash bin" in Chinese, which should be translated into "box." In this work, we propose a VMT system to address both the verb and the noun sense ambiguity problems by using both motion and spatial representations in a video. To obtain spatial representations efficiently, we propose to use a hierarchical attention network (HAN) (Werlen et al., 2018) to model the spatial information from object-level to video-level, thus we call it the spatial HAN module. Additionally, to obtain a better contextual spatial information, we add several kinds of middle layers between the object-to-frame layer and frameto-video layer in the original HAN. Experiments on the VATEX dataset (Wang et al., 2019) show that our VMT system achieves 35.86 corpus-level BLEU-4 score on the VATEX test set, yielding a 0.51 score improvement over the single model of the SOTA method (Hirasawa et al., 2020).

VMT with Spatial HAN
The overview of the proposed model is presented in Figure 2, which consists of components in the VMT baseline model (Hirasawa et al., 2020) and our proposed spatial HAN module.

VMT Baseline Model
Hirasawa et al. (2020) proposed a strong VMT baseline model, which consists of the following three modules. Text Encoder. Each source sentence is represented as a sequence of N word embeddings. Then, the Bi-GRU (Schuster and Paliwal, 1997) encoder transforms them into text features U = {u 1 , u 2 , ..., u N }. Motion Encoder. The VATEX dataset already provides motion features obtained by the pretrained I3D model  for action recognition. A Bi-GRU motion encoder first transforms motion features into motion representations M = {m 1 , m 2 , ..., m P }. Then, a positional encoding (PE) layer PE (Vaswani et al., 2017) encourages the model use the order of the motion features and obtain ordered motion representations M * , represented as: Target Decoder. The sentence embedding U from the source language encoder and the ordered motion embedding M * from the motion encoder are processed using two attention mechanisms (Luong et al., 2015): where Attention denotes a standard attention block, h t−1 denotes the hidden state at the previous decoding time step. Text representations r u,t and motion representations r m,t are allocated by another attention layer to obtain a contextual vector r c,t at decoding time step t. The contextual vector is fed into a GRU layer for decoding: where f gru refers to the GRU decoding layer and y denotes the output target word embedding.

Spatial HAN
After splitting one video into X frames, we extract Y object-level spatial features Because of the effectiveness of the PE layer (Vaswani et al., 2017) in the VMT baseline model, we also apply it to the object-level spatial features.
R i o denotes the object-level spatial representations of i-th frame. Werlen et al. (2018) show that HAN can capture contextual and inter-sentence connections for translation. We propose to use HAN to extract contextual spatial information from adjacent frames within one video clip. With some modifications, we call the network spatial HAN.
The overview of spatial HAN is given by Figure 3. The object-level attention layer summarizes information from all separated objects in their respective frames: where the function l o is a linear layer to obtain the query q o,t . We adopt an attention layer to transform object-level spatial features R i o into respective frame-level spatial features r i f ,t . f d denotes the middle encoding layer to obtain contextual frame-level spatial features R * f,t at time step t. The frame-level attention layer then summarizes representations from all ordered frames to videolevel abstraction r v,t : where l f is a linear transformation, q f ,t is the query for attention function at time step t.

Target Decoder with Spatial HAN Features
The target decoder in our system contains three types of inputs: text representations r u,t , motion representations r m,t , and contextual spatial representations r v,t from spatial HAN. The contextual vector r c,t and the decoding GRU layer at each decoding step t become: 3 Experiments

Dataset
The dataset we used for the VMT task is VATEX, which is built on a subset of action classification benchmark DeepMind Kinetics-600 (Kay et al., 2017). It consists of 25, 991 video clips for training, 3, 000 video clips for validation, and 6, 000 video clips for public test. Each video clip is accompanied with 5 parallel English-Chinese descriptions for the VMT task. However, the VATEX dataset only contains parallel sentences and segment-level motion features. To extract spatial features, we recollected 23, 707 video clips for training, 2, 702 video clips for validation, and 5, 461 video clips for public test, where about 10% clips are no longer available on the Internet. Therefore, we lack 10% spatial features for the dataset, so the experiment comparison is inherently unfair for our method.

Settings
We directly used the implementation of Hirasawa et al. (2020) as our VMT baseline model. For the common settings in our proposed approach and in the VMT baseline model, we set the maximum sentence length to 40, word embedding size to 1, 024, and the text encoder and motion encoder of both 2-layer bi-GRU with hidden dimension of 512. For our proposed spatial HAN, we used Faster R-CNN (Anderson et al., 2017) to extract object-level features as the input. The hidden dimensions of both object-level and frame-level attention layers were 512. As for the middle layer f d in spatial HAN, we examined GRU and LSTM with the hidden dimension of 512, and spatial HAN without the middle layer. The target decoder was a 2-layer GRU with the hidden dimension of 512. During training, we used Adam optimizer with a learning rate of 0.001 and early stop with patience to 10 times. The vocabulary contained lower-cased English and characterized Chinese tokens that occurred more than five times in the training set, which is provided by Hirasawa et al. (2020) whose sizes are 7, 949 for English and 2, 655 for Chinese. We adopt corpus-level BLEU-4 as the evaluation metric. We reported the score of the VMT baseline model denoted as "VMT baseline: Text+Motion," naming that it uses both the text and motion encoders. Besides the experiments with text, motion and spatial features obtained by our methods, denoted as "Ours: Text+Motion+Spatial," we also conducted the experiments with only text and spatial features denoted as "Ours: Text+Spatial."   model with the text corpus and action features. Because of some different settings in hyperparameters, our VMT baseline has 0.24 BLEU score improvement over the best single model. Table 2 shows the ablation study on different settings of middle layer choice. Without the middle layer, both the two models achieved the best validation score. The reason may be that the PE layer for object-level spatial features already provides the contextual information, thus the middle layer in spatial HAN is dispensable. We notice that our models achieve comparable BLEU score results with and without motion features. We assume that it may come from the misalignment between motion, spatial and text features, where nouns and verbs in the sentences are not aligned to spatial features and motion features strictly. Also, the amount of nouns in sentences are much more than the amount of verbs in sentences, e.g., the ratios of nouns and verbs in source training corpus are 0.29 and 0.17, thus spatial features will take on more roles.

Model
We further conducted a pairwise human evaluation to investigate how our proposed method improves the translation. Results on 50 random samples show that our model has 12 better translations than the VMT baseline model mainly on the noun sense disambiguation, where the VMT baseline model has 6 better translations mainly on the verb sense disambiguation and syntax. This suggests that our model can alleviate the noun sense ambiguity problem. The analysis details of several examples are given by Figure 4.

Conclusion
In this work, we proposed a VMT system with spatial HAN, which achieved 0.51 BLEU score improvement over the single model of the SOTA method. The result also showed the effectiveness of spatial features for the noun sense disambiguation. Our future work will focus on the alignment between text, motion and spatial representations.