MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.


Introduction
One key challenge for building robust artificial intelligence systems is to handle information that lies across multimedia data. Visual Question Answering (VQA) (Wu et al., 2017) is a specific example of the challenge, where, given a natural language question about an accompanying image, the system is required to produce a correct answer. This is a typical multimodal problem since the intelli- * These authors contributed equally to this work and should be considered co-first authors.
† Corresponding author.  gence system needs to understand images and texts simultaneously.
From the perspective of a single modality, there have been plenty of backbone methods for learning better representations of either language or vision. For learning language representations, researchers have developed several pre-trained models, such as GPT-2 (Radford et al., 2019), BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). These approaches can learn the universal language representations on the large-scale corpus, which are beneficial for downstream tasks (Qiu et al., 2020). Concerning visual representations, ; Ren et al. (2017); Simonyan and Zisserman (2015) have been widely applied to extract image features. Despite the success of these single-modality works, learning the relationships between different modalities is still an unsolved problem.
Existing VQA approaches focus on modeling the relationship between visual and language features represented by bilinear models. For example, through applying bilinear feature fusion methods, the image and text representations are projected into a uniformed higher-dimensional space. Multimodal Compact Bilinear pooling (MCB) (Fukui et al., 2016) processes the vectors in Fast Fourier Transform (FFT) space. For better cross-modality information exchange, (Yu et al., 2019) and (Tan and Bansal, 2019) utilize coattention/cross-attention networks to capture the high-level fusion features. Generally, these bilinear approaches only consider how to learn the joint representations between the questions and the images while the answers are processed as labels, making VQA task a multi-class classification task (Tan and Bansal, 2019;Zhu et al., 2016;Lu et al., 2019;.
However, the answers contain semantic information, related to the question and the visual context.
For considering the answer information, trilinear models represented by Compact Trilinear Interaction (CTI) (Do et al., 2019), are designed to learn the alignment relationships between the visual context, the answers, and the questions. Unfortunately, the trilinear interaction in CTI only considers the inter-modality relationships but ignores the intramodality information, leading to unsatisfactory inference results.
To tackle the above problems in the context of VQA, we propose a new trilinear modalities interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers). Specifically, MIRTT can extract more refined high-level feature information from the inter-modality and intra-modality relationships by introducing interactive attention networks across three modalities and three selfattention networks within a single modality. In general, MIRTT can accommodate requirements for processing three different modal features and efficiently utilize the information from the answers.
The contributions of our work are as follows: • By considering the inter-modality and intramodality relationships, we introduce a new end-to-end trilinear interaction model MIRTT, that enhances each single modality representation by proposed attention networks, resulting in better inference ability in VQA. • We propose a two-stage workflow to simplify the harder Free-Form Opened-Ended (FFOE) VQA into simpler Multiple Choice (MC) VQA, which provides a method to solve difficult VQA tasks. • Our proposed MIRTT achieves state-ofthe-art performance on Visual7W telling task (Zhu et al., 2016) and VQA-1.0 for MC VQA and outperforms the bilinear methods on the VQA-2.0 (Goyal et al., 2017), (Kafle and Kanan, 2017a) and GQA (Hudson and Manning, 2019) datasets for FFOE VQA. Moreover, we take advantage of the pre-training task on our model, improving multi-modality understanding.

Related Work
Visual question answering (VQA) task. Following Antol et al. (2015) who defined the VQA task (i.e., obtaining answers from a given imagequestion pair), has received significant attention from the entire artificial intelligence community (Wu et al., 2017). There are two major types of VQA tasks, Multiple Choice (MC) VQA and Free-Form Opened-Ended (FFOE) VQA (Do et al., 2019). In MC VQA (Zhu et al., 2016;Kafle and Kanan, 2017b), the answer is chosen from a candidate answer list for a given image-question pair accessible in both training and test scenarios. FFOE VQA is more complicated since the answers are only available in the training phase, and there is no candidate answer list for choosing answers. However, FFOE VQA is the most common VQA task and almost all models are aimed at this problem. The general solution is to extract the visual features and linguist features first and then fuse them with a multi-modality fusion model, followed by a classifier or a generator to obtain the answer (Wu et al., 2017). Among them, exploring different fusion approaches is the mainstream research direction. On the one hand, the interactive relationships between the query image and the question have been fully modeled, such as element-wise operations (Antol et al., 2015) and bilinear methods (Fukui et al., 2016;Kim et al., 2018;Ben-Younes et al., 2017. On the other hand, some works have improved the VQA performance by considering the answer information (Hu et al., 2018;Do et al., 2019). For example, Jabri et al. (2016) combines the three input representations through a simple Multilayer Perceptron (MLP), and  introduces a layered fusion operation by merging the image-question bilinear embeddings and the image-answer bilinear embeddings in joint embedding space. In order to solve VQA in a targeted manner, we make full use of the answer information and propose a two-stage workflow, which converts FFOE VQA to MC VQA. Attention-based networks. Inspired by human's natural mechanism,  introduce the attention mechanism to VQA and achieve success.
For bilinear feature fusion, some attention mechanisms have been proposed, such as co-attention (Lu et al., 2016) and dual attention (Nam et al., 2017). In terms of trilinear feature fusion, the attention map for trilinear inputs is computed by PARAL-ING decomposition (Do et al., 2019). However, the output is only a joint vector for classification. In order to enhance each single modality representation by fusing the other modalities, we propose Trilinear Interaction Attention (TrI-Att). Moreover, although self-attention can not fuse different modalities, it can enhance the interaction within each modality (Yu et al., 2019). Therefore, we design Self-Attention (Self-Att) unit for capturing the intra-modality information.
Multimodal contextual representations. The transformer-based models can achieve good performance in the vision-language tasks. These models normally employ multi-layer transformers to learn multimodal contextual representations. There are two basic types of their architectures: singlestream and two-stream. The single-stream models concatenate image and language features first, and then they get the cross-modality representations with a single multi-layer transformer, such as VL-BERT (Su et al., 2020) and UNITER (Chen et al., 2020). The two-stream models take advantage of self-attention transformers to encode language and image features respectively, and then build joint representations with cross-attention transformers, such as LXMERT (Tan and Bansal, 2019) and ViL-VERT (Lu et al., 2019). To better align visionlanguage semantic, some works try to pre-train transformer-based structures on a large corpus of image-text pairs. The pre-training tasks usually include masked language prediction, RoI-feature regression, detected-label classification and crossmodality matching (Tan and Bansal, 2019). In this paper, our proposed trilinear transformers deal with the three input embeddings different from former transformer-based methods.

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers
As shown in Figure 3, our model considers three modality forms of input (e.g., images, questions and answers). The backbone of MIRTT is two transformers with multiple layers, which are based on TrI-Att and Self-Att mechanisms. Finally, in the output layer, we adopt MLP for specific down-stream tasks.

Single-modality Embedding Extraction
Image embeddings. The image embeddings are extracted from a Faster R-CNN model (Anderson et al., 2018), a regional visual feature extractor. In terms of specification, for each object, it extracts a vector with d v dimensions. Therefore, an image with v objects is represented as an embedding matrix V ∈ R v×dv . Question and answer embeddings. We adopt BERT (Devlin et al., 2019) to fine-tune as our text extractor in the experiments. Specifically, the text is converted to WordPiece embeddings first (Wu et al., 2016). Then through fine-tuning, each embedding will be projected into R dq or R da , for question and answer, respectively. Finally, the question with a max length of q is represented as Q ∈ R q×dq , and the same for the answer that A ∈ R a×da .

Trilinear Transformers
Image Embedding

Question Embedding
Answer Embedding M

Figure 2: Trilinear interaction attention
TrI-Att for inter-modality representations. For better cross-modality information fusion, we design TrI-Att to project single-modality embedding into inter-modality enhanced space ( Figure  2). From section 3.1, let S = {V, Q, A} be the multimodal input collection. Firstly, we introduce the attention map M ∈ R v×q×a , which is mainly computed by matrix multiplication and sum-based dimension reduction. The detailed calculation process is as follows: here, we take image representation V for example (questions and answers are as the same), and the fusion operation is similar to Eq. 1. We further utilize multi-head attention (Vaswani et al., 2017) to improve the robustness by introducing a linear mapping for each single-modality representation. In general, the complete calculation of inter-modality fusion is as follows: where W i V , W i Q and W i A are multi-head linear mappings, which are shared across the three forms of representations. N h is the number of heads. || indicates the concatenation of all multi-heads. Similarly, the fusion representations of questions and answers are: After that, a fully connected feed-forward network with residual connection follows. Self-Att for intra-modality representations. We apply the encoder of Transformer (Vaswani et al., 2017) to capture the intra-modality relationships.
We deploy a multi-head self-attention mechanism, followed by a feed-forward network with the residual connection. With input feature X ∈ R n×d , the multi-head self-attention is working as: is the projection matrix for a certain modality M in ith head. This structure can enhance the long-distance dependency among the multi-modality features, while weaken negative impact on the result to a certain degree. Trilinear transformers stacks. In total, the trilinear transformer stacks N L layers, where each layer efficiently combines two transformer modules. The multiple modalities transformer has a trilinear interaction attention module and a fully connected feed-forward (FF) network. And the single modality transformer has three self-attention modules, following the same structure of the encoder in Transformer (Vaswani et al., 2017). Our essential motivation is to take advantage of the answer information, so a trilinear model is deployed first to fuse the three modality information. However, this leads to the loss of information in each modality to some extent, so a single-modality transformer is followed to reinforce the information of each own. For the MC VQA task, we put the pooled answer representations of the final layer into a binary classifier. Pick the answer of the highest binary score as the right one.

Two-stage Workflow
In FFOE VQA, previous models usually do not take the answer as input for keeping the same input dimensions in the training and test phases because  Figure 4: The overview of two-stage workflow 2 the answer is not available in the test set. Knowledge distillation is a solution that trilinear methods could run by teaching a bilinear model in the training phase, and the bilinear model is evaluated in the test phase. However, the answer information is still inaccessible in the test set. Therefore, as shown in Figure 4, we introduce a two-stage workflow to make full use of the dataset. Our proposal is a universal simplification process for the FFOE VQA task, which gives full play to the advantages of the bilinear and trilinear models.
In the first stage, we train a bilinear model for the questions and images first, and then the top four candidate answers are provided for each question based on the output logits. Since the bilinear model performs very high accuracy on the training set, the candidate answers basically contain the correct answer. In the test phase, the trilinear model is fully dependent on the candidates from the bilinear model.
In stage two, the candidates are first restructured into several image-question-answer pairs by reuse the input image and question; therefore, the number of the pairs is equal to that of the candidates. Then the trilinear model utilizes the imagequestion-answer pairs to choose a confident answer under the MC VQA task setting (illustrated in Section 3.2), where the answer-question and answerimage alignment information is learned.

The Pre-training Strategy
In the hope of initializing our model effectively, we pre-train our model with the masked language modeling task, which is in a way similar to BERT (Devlin et al., 2019). Since our model is trilinear, the pre-training data format is triple of the question, image, and correct answer. We utilize Visual7W, VQA-2.0, and TDIUC datasets (the training set) to Dataset. Visual7W is a subset of Visual Genome (Krishna et al., 2017). For each questionimage pair, there are four candidate answers, where only one choice is correct. There are two tasks for Visual7W: pointing and telling, and we conduct our method on telling task. VQA-1.0 MC (Antol et al., 2015) is similar to Visual7W, while there are 18 candidate answers for each question.
Metrics. Each question only has one correct answer. Accuracy (Acc-MC) is used to measure the performance (Zhu et al., 2016;Antol et al., 2015). We evaluate our methods on "test" split of Visual7W and "test-std" split of VQA-1.0 MC.

FFOE VQA Tasks
Dataset. VQA-2.0 is built from MSCOCO dataset (Lin et al., 2014). VQA-2.0 minimizes answer biases so that a language-only "blind" model can not guess the right answers. TDIUC is a large VQA deadset of real images, which has over 1.6M questions of 12 categories. GQA consists of 22M questions, and each image corresponds to a scene graph. The questions focus on visual reasoning and compositional question answering.
Metrics. In VQA-2.0, each question has ten human-generated answers. To present the interhuman variability, we define the accuracy-based evaluation metric (ACC) as follows (Wu et al., 2017): where n is the frequency of the answer given by the model in the answer set of the corresponding image-question pair. In TDIUC and GQA, there is only one right answer for each question. Therefore, normal accuracy is used. For details, we evaluate our methods on "test-dev" split of VQA-2.0, "Valid" split of TDIUC, and "test-std" split of GQA.

Implementation Details
Except for the referenced models and special instructions, we fine-tune BERT as our text extractor for questions and answers. And we freeze the Faster R-CNN detector (Anderson et al., 2018) without fine-tuning as the image extractor. For images, the maximum detected bounding box is 3 The result is not the same as in the cited paper. Regrettably, after a lot of experiments, we still cannot reach the accuracy in the cited papers. Under the fair experimental environment, the ensemble result outperforms the bilinear result. set to 50. For texts, the questions and answers are trimmed to a sentence with a maximum length of 12 tokens and 6 tokens, respectively.
The hyper-parameters of MIRTT follow the default unless otherwise noted. The dimensions of input images (d v ), questions (d q ) and answers (d a ) are 2048, 768 and 768. To simplify the calculation, we reduce d v to 768 with a linear projection. For the TrI-Att and Self-Att, the number of heads is 12, and the hidden dimension d h is 64.
In all experiments with a two-stage workflow, we utilize six layers MIRTT with collection 2 (Table 3). Furthermore, our codes will be made publicly available with instructions https://github.com/IIGROUP/MIRTT. More experimental settings can be found in the Appendix.

MIRTT Performance on MC VQA
As shown in Table 1, we compare our methods with previous methods on Visual7W telling task and VQA-1.0 multiple-choice task.
MCB (Fukui et al., 2016): a method that considers FFT space to combine multimodal features.
CTI (Do et al., 2019): a method that learns highlevel associations between three inputs by using multimodal-tensor-based decomposition.
Dual-MFA (Lu et al., 2018): a framework that fuses input embedding by selecting the free-form image regions and detection boxes most related to the input question.
MFH : a framework that models both the image attention and question attention simultaneously.
Our MIRTT with fine-tuning (Table 3) improves the CTI ACC-MC by 2.1% and improves the MFH ACC-MC by 3.6%.

MIRTT Performance on FFOE VQA
To evaluate the effectiveness of the two-stage workflow (Figure 4), we apply several bilinear methods as our backbones in stage one and set our MIRTT as trilinear methods in stage two. In detail, the candidate answers lists are generated by baselines; each contains four answers.
SAN : Stacked Attention Network utilizes multiple attention layers by querying an image multiple times to infer the answer.
BAN2 (Kim et al., 2018): Bilinear Attention Network fuses the question embeddings and image embeddings by utilizing co-attention.
MLP: in this method, we use the first output token embedding as the global representation of a question. Then, we sum up all object embeddings of an image after multiplying a learning weight for each one. The global representations of the question and the image are then added and fed into an MLP layer for classification.
ViLBERT (Lu et al., 2019) builds intra-and inter-relationship between vision and language base on a pretrained transformer structure.
Ensemble results: the predictions are calculated by considering the outputs of stage one and stage two. The ensemble method normalizes the two results separately and adds them together. The final prediction is the candidate answer with the highest probability.
As shown in Table 2, the trilinear results and ensemble results outperform the bilinear results. Our two-stage workflow solves the problem that trilinear models are not able to be deployed in FFOE VQA. Furthermore, ensemble results show that bilinear models can utilize the answers after modeling the answer information by the trilinear models.
In particular, the GQA dataset is not introduced in pre-training data. Our two-stage workflow and MIRTT present better performances than the baseline methods, which shows the generalization capability of our approaches.

The Components of MIRTT
Stacking layers and the size of pre-training data. As shown in Table 3, MIRTT only needs two layers to significantly outperform the others in "Random" based on accuracy.
Random: MIRTT is trained on Visual7W without pre-training.  Collection 1: MIRTT is pre-trained on the train sets of Visual7W and VQA-2.0.
After pre-training, MIRTT outperforms the nonpre-trained one in each layer from random and collection 1. And as the number of layers increases, the accuracy of MIRTT with collection 1 is improved. However, as the number of layers increases, the capability of MIRTT with collection 1 seems to reach its limit at six layers, and growth hits a bottleneck.
Therefore, we add one more dataset to pre-train MIRTT. Comparing with collection 1, MIRTT in collection 2 can break the previously mentioned bottleneck and reaches the best score at the highest layer with more pre-train data. Perhaps similar to ViT (Dosovitskiy et al., 2021), these attentionbased deep models are sensitive to dataset size. Therefore, the pre-trained MIRTT benefits from a larger number of parameters and more data, achieving an accuracy of 74.4%. Moreover, we conduct the randomized Tukey HSD p-values and effect sizes based on one-way ANOVA (Sakai, 2018) to support statistical significance of our results. Details are in the Appendix. Attention mechanisms. Since CTI does not consider the intra-modality information, we attempt to build some structures to enhance it. In the term "CTI + Self-Att", the original output of CTI is a joint representation, then make a fusion by adding text embeddings and the joint representation. After that, we implement the Transformer's encoders (Vaswani et al., 2017) with two layers. As shown in Table 4, after adding self-attention to obtain fine-grained information within the modality, the CTI is improved by 0.5% compared to the original model.
BERT * : We fine-tune BERT on input questions and answers and fuse the extracted image embed- [Question] What is the main color of the train?

Method
Acc-MC CTI 72.3 CTI + Self-Att 72.8 BERT * 65.4 BERT + TrI-Att 70.5 BERT + TrI-Att + Self-Att (MIRTT) 70.9 dings. In detail, we utilize the same operation of Bottom-Up and Top-Down (BUTD) (Anderson et al., 2018) to fuse all representations. To discuss two key components ("TrI-Att" and "Self-Att"), we utilize two layers of MIRTT without pre-training as our basic structure. By replacing the simple fusion methods like adding, we enhance the input embeddings by considering the inter-modality information in TrI-Att. 5.1% improves the accuracy as a result. Considering CTI can benefit from selfattention mechanism, we implement the Self-Att in our trilinear transformers. From the relative 0.4% improvement, our MIRTT can also learn the intramodality information like "CTI + Self-Att". Visualization for TrI-Att. Figure 5 visualizes the behavior of MIRTT by showing detailed attention values of TrI-Att. The detected objects are presented with their numerical labels. The special tokens in questions and answers are provided by BERT (Devlin et al., 2019). For the imagequestion-(answer "Red") pair, the correlation of object "5" (the train) and token "Red" has a great attention value. Moreover, the model focuses on the pair "5"-"train"-"Red", which is helpful in reasoning that the train in the image is red. In terms of the answer "Gold", the locomotive (object "2") gains more attention than the object "2" in "Red". Therefore, the answers could assist MIRTT in predicting the correct choices. Figure 6 describes some examples with applying our two-stage workflow (Figure 4). In detail, the text extractors are all GRU, and the trilinear methods are MIRTT. The results show that our trilinear method is able to retrieve the most proper answer by utilizing the abundant information of the answers. Whether the problem requires stronger reasoning skills in (a), or the ability to find correspondences (images, questions, and answers) as in (b) and (c), MIRTT can handle it with a two-stage workflow. Following different bilinear methods as backbones, the trilinear method might predict different answers, such as (d).

Conclusions
We introduced a trilinear interaction framework called MIRTT, which captures inter-modality and intra-modality information of images, questions, and answers. Our method is based on TrI-Att and Self-Att mechanisms. The pre-trained model shows the effectiveness among the baselines on several datasets. Meanwhile, a two-stage workflow is introduced to apply the trilinear methods to FFOE VQA,  Figure 6: A collection of image-question-answer pairs by random selection from VQA-2.0 (Goyal et al., 2017). Comparisons of whether to use two-stage workflow and different bilinear methods at the stage one in the test phase.
showing improvements on VQA-2.0, TDIUC and GQA. We achieve state-of-the-art results on Vi-sual7W and VQA-1.0 MC. Generally, with rich experimental comparisons and extensive discussion, we demonstrate the value of the answer information and provide a solution for the VQA tasks.  Pre-training Dataset. Amount of data for pretraining datasets is shown in Table 5. Pre-training dataset collection 1 includes Visual7W (train/val) and VQA-2.0 (train/val). In VQA-2.0, there is a list of human-generated answers to one question.
We treat the answer with the highest score as the correct answer. The size of the image-questionanswer pre-training collection is about 756k. In collection 2, we add the train set of TDIUC to pretrain MIRTT. The size of pre-training tuples grows to 1.87M.

B Implementation details
The whole details will be presented in our opensource codes on Github. The Focal loss (Lin et al., 2017) is used for training the proposed models.
Pre-training. When we pre-train MIRTT, the batch size is 128, and the initial learning rate is 1e-4. We use the model of epoch 7 for later finetuning.
Visual7W. Batch size is set to 32 for all models. For random initialization, the learning rate is 1e-4, and the number of the epoch is 17. For fine-tuning, the learning rate is 3e-5, and the number of the epoch is 11. VQA-1.0. Initial learning rate is set to 1e-4, and batch size is 16. Since each question has 18 choices, there are 288 samples in one batch. Two-stage workflow. The settings of hyperparameters of two-stage workflow are presented in Table 6. Stage one and stage two are separate, not end-to-end structures. In stage one, the bilinear model is trained and we adopt a cross-entropy loss function to get the logits of the answers. Then, we get the top four candidates based on the logits, which is the generated answer list. In stage two, we encode the answers and put those embeddings of three modalities into MIRTT to get representations. We put the pooled answer representations into a binary classifier and apply binary cross-entropy based on labels generated from the FFOE dataset. The number of candidate answers. To make our proposal a universal framework on both FFOE VQA and MC VQA tasks, we set the candidates to be four Visual7W on VQA and RACE (Lai et al., 2017) on QA. We will do related explorations based on this in the future. There are a few interesting problems. For example, if the candidate answers don't include the correct answer, the trilinear model won't work for this question. However, this problem is always possible unless the candidate list includes all the answers, which is impossible. A limited extension of the candidate list could help improve the coverage of correct answers while contradicting our design's universality.
C P-value based on Randomized Tukey HSD tests Table 7, Table 8 and Table 9 show the statistical significance test results of the runs on Table 3. The name of runs are following the rules: name = D_L, where D ∈ {Rand, Col1, Col2} is the name of the size of pre-training data and L ∈ {1, 2, 4, 6, 8} is the number of layers to use in MIRTT. For example, Col2_2 stands for two layers MIRTT in collection 2.