Transformer-Exclusive Cross-Modal Representation for Vision and Language

Ever since the advent of deep learning, cross-modal representation learning has been dominated by the approaches involving convolutional neural networks for visual representation and recurrent neural networks for language representation. Transformer architecture, however, has rapidly taken over the recurrent neural networks in natural language processing tasks, and it has also been shown that vision tasks can be handled with trans-former architecture, with compatible performance to convolutional neural networks. Such results naturally lead to speculation upon the possibility of tackling cross-modal representation for vision and language exclusively with transformer. This paper examines transformer-exclusive cross-modal representation to explore such possibility, demonstrating its potentials as well as discussing its current limitations and its prospects.


Introduction
While early cross-modal models handled visuolinguistic tasks with template-based methods (Barbu et al., 2012;Elliott and Keller, 2013), or as a retrieval model (Farhadi et al., 2010;Ordonez et al., 2011), the advent of deep learning introduced end-to-end learning models for cross-modal tasks, in which convolutional neural networks (CNNs) (Krizhevsky et al., 2012) are employed for vision representation, whereas recurrent neural networks (RNNs), such as LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014), are employed for language representation. While a plethora of variations exist, most models proposed in the past few years have invariably relied on the CNN-RNN approach.
Such standardized scheme, however, started to change with the introduction of transformer architecture based on multi-head attention mechanism (Vaswani et al., 2017), which rapidly started to achieve state-of-the-art performance in natural language processing (NLP) (Peters et al., 2018; and speech recognition domains (Dong et al., 2018;Wang et al., 2020b), frequently outperforming RNNs. Furthermore, large-scale models based on transformer architecture, such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020), started to appear, demonstrating that pre-training a sufficiently large model with a very large amount of data results in strong performance with versatility for various downstream tasks.
The success of transformer-based models in NLP and speech recognition naturally led to its adaptation in cross-modal tasks. (Lu et al., 2019) proposed ViLBERT, a pioneering BERT-inspired work that proposed to tokenize the images for compatibility with transformer architecture, and also to extend the pre-training objectives of BERT to reflect the nature of cross-modality. Many other crossmodal models followed, but mostly with similar approaches for image tokenization and pre-training objectives. This line of transformer-based crossmodal works described above, however, still heavily relied on CNN-based models, such as Faster R- CNN (Ren et al., 2015), to extract features from images, and the application of transformer was mostly limited to language representation. Inspired by the observations made by recent works (Dosovitskiy et al., 2020), which demonstrate that vision tasks can be handled solely by transformer architecture with compatible performances to CNN-based models, this paper examines cross-modal representation for visuolinguistic tasks relying exclusively on transformer architecture, without using CNNs or RNNs. Without any structural modifications or advanced common embedding scheme, and without additional cross-modal pre-training that can be expensive both computationally and data-wise, our model demonstrates comparable performances to conventional approaches based on CNNs and RNNs in exemplary cross-modal tasks.

Related Works
ViLBERT (Lu et al., 2019) was one of the first models to extend transformer architecture to crossmodal visuolinguistic tasks. They propose coattentional transformer, in which separate transformer modules for each modality run in parallel, with the key and value inputs from one modality entering the transformer block for the other modality, thereby learning cross-modal dependence. In order to tokenize the image, they extract image regions using Faster R- CNN (Ren et al., 2015) along with 5-dimensional location vector. They also extend two unique pre-training objectives of BERT, namely masked language modeling and next-sentence prediction, to cross-modal setting, as masked multi-modal learning and image-sentence alignment classification. In masked multi-modal learning, visual tokens, along with language tokens, are randomly masked, and the model is trained to predict their probability distribution over object classes. In image-sentence alignment classification, a sequence of visual tokens and a sentence are juxtaposed, and the model performs a binary classification task, predicting whether the sentence describes the contents of the image. Many other models, such as VisualBERT (Li et al., 2019), LXMERT (Tan and Bansal, 2019) and Unicoder-VL , also follow nearly identical pre-training objectives as VilBERT. On the other hand, UNITER (Chen et al., 2020) demonstrates improved performance by introducing additional pre-training objective of word region alignment, while MiniVLM (Wang et al., 2020a) achieves comparable performance with up to 70% fewer parameters by utilizing EfficinetNet (Tan and Le, 2019) with their own Compact BERT model. While all models described above rely on CNNbased models to extract features from images, limiting the scope of applicability of transformer, recent works have demonstrated results that may imply a potential change in such workflow. (Dosovitskiy et al., 2020) proposed Vision Transformer (ViT), which demonstrates that pure transformer architecture without convolution can achieve comparable performance in image classification tasks, while requiring substantially less computational costs. Furthermore, (Touvron et al., 2021) showed via data-efficient image transformers (DeiT) that competitive performance can be achieved with training only on ImageNet (Deng et al., 2009) with no external data.

Model
We employ separate transformer models for vision and language, although internal mechanisms are essentially identical. Following (Dosovitskiy et al., 2020), we split an image into N patches x p of P × P pixels, each of which is linearly projected into D-dimensional patch embedding, where P = 16, and D = 768. A learnable embedding x class is prepended to patch embeddings, and positional embeddings are also added. The input sequence z 0 subsequently undergoes alternating layers of layer normalization (Ba et al., 2016) and multi-head selfattention, followed by a 2-layer MLP with GELUs (Hendrycks and Gimpel, 2020) as non-linearity: For language representation, we employ an offthe-shelf BERT model. An input sequence s 0 = [w 0 , ..., w S ] is given with special tokens [CLS] and [SEP ] inserted at the beginning and the end of the sequence respectively. In case of two sentences within the input sequences, [SEP ] token is also inserted in between the two. Each token is represented as the sum of word embedding, position embedding, and segment embedding, and undergoes bidirectional multi-head self-attention over multiple layers. The representation for the input sequence is obtained as h 0 , ..., h S from the upper- y lang = LN(s 0 L ) (10) We now project the image and language representations obtained into common embedding space by concatenation: y = Concat(y img , y lang ) (11) Note that we deliberately choose the most elementary common embedding scheme, as our focus is to examine the performance of the features themselves, rather than the embedding scheme. It is thus highly likely that, when coupled with more sophisticated embedding schemes, a significant performance boost will occur. Fig. 1 describes the overall architecture of our approach.

Setting
For images, we use ViT-B model pre-trained on ImageNet-21k. The model contains 12-layers with 12 attention heads and hidden size of 768, consisting of 86M parameters. For language, we use offthe-shelf BERT BASE mode, trained with BERT's pre-training objectives of masked language modeling and next sentence prediction on BookCorpus (Zhu et al., 2015) and English Wikipedia. Like ViT-B, the model contains 12-layers with 12 attention heads and hidden size of 768, and consists of 110M parameters. During both training and testing, image and language features are extracted from the uppermost layer of respective model, and we concatenate them to make a 1536-dimensional vector. Concatenated features are trained with crossentropy loss and Adam (Kingma and Ba, 2014) optimizer.
We evaluate our model on the following commonly tackled cross-modal visuolinguistic tasks; visual question answering (VQA) (Antol et al., 2015;Goyal et al., 2017), visual commonsense reasoning (VCR) (Zellers et al., 2019), and reasoning about natural language grounded in photographs (NLVR2) (Suhr et al., 2019). For VCR, we followed (Lu et al., 2019) by making 4 possible pairs of question and answer. For NLVR2, we follow the pair approach of (Chen et al., 2020), by embedding each image and the query, as it is reported to outperform triplet approach of embedding two images with the query. We trained with 4 V100 GPUs with batch size 96 for VQA and NLVR2, and 48 for VCR, which were adjusted with respect to the memory constraint of the computational environment. Learning rate was initially set to 1e-4 under linearly decaying schedule with warm up. We trained the model for 25 epochs for each task.

Results
Table 1 compares our model's performance with other transformer-based cross-modal models. While our model's performance falls below that of state-of-the-art models, it is noteworthy that other models explicitly perform additional cross-modal pre-training on top of already pre-trained vision and  cross-modal models. 2nd column refers to the number of image-caption pairs seen during cross-modal pre-training. Despite the disadvantage of not having seen a substantial amount of pre-training data, our model closely follows the state-ofthe-art models. (CAPT (Luo et al., 2020), ERNIE-ViL  language modules. On the other hand, our model simply concatenates two pre-trained models without additional cross-modal pre-training, and is immediately trained with the target task. For example, ViLBERT and DeVLBERT (Zhang et al., 2020) are pre-traiend with Conceptual Captions dataset (Sharma et al., 2018), and our model is at the disadvantage of not having seen 3.3M pairs of image and captions, yet comes fairly close to those pre-trained models. Fig. 2 shows qualitative examples of the model's performance on each task. In addition, while many papers on pre-trained cross-modal representations do not report specific number of parameters, our approximations of other models' sizes based on the implementation details reported in respective papers suggest that our model is reasonably smaller, especially since it completely eliminates the need for external region detector.

Further Experiments
In order to examine how much each component contributes to performance, we conduct further experiments, replacing each component with conventional modules. We first replace transformer en-  coder for images with CNN module, specifically with ResNet-50 (He et al., 2016) trained on Ima-geNet, using global average-pooled features. We also examine replacing BERT module with LSTM (Hochreiter and Schmidhuber, 1997) using early fusion with image features. For fair comparison, we used concatenation as common embedding scheme for all combinations. Table 2 shows the results. While ResNet/BERT comes fairly close, it falls below our model, and performance drop is clearer with ViT/LSTM, possibly reflecting superior adaptability of BERT compared to LSTM. Our conjecture is that architectural integrity, i.e., using the same architecture for both vision and language, throughout the model, plays an important role in learning cross-modal representations. Note that, however, it would require a more thorough and analytical study to conclusively claim that ViT is superior to ResNet, or that attention is superior to convolution, and our primary purpose in this experiment is simply to demonstrate that transformer-exclusive models can accomplish comparable performance to the models employing CNN. We also examined linearly training a classifier for target task while fixing the extracted features, without fine-tuning. As expected, there is a significant performance drop, reaffirming the premise that the competence of transformer and BERT is attainable via fine-tuning to its down-stream tasks.
Note that, although we performed experiments on a small set of cross-modal tasks, given the superior performance of ViT over ResNet on image classification as reported by (Dosovitskiy et al., 2020), and also on other computer vision tasks as reported by models like pyramid vision transformer (Wang et al., 2021), we believe any task that involves vision and language is a potential beneficiary of transformer-exclusive approach, since it enables the architectural integrity for both modalities.

Conclusion
This paper proposed to handle cross-modal tasks for vision and language, solely based on transformer architecture, examining it in various crossmodal tasks. Our paper admittedly does not claim state-of-the-art performances, but to the best of our knowledge, our work is one of the first attempts, along with models like ViLT (Kim et al., 2021) and UniT (Hu and Singh, 2021), to examine cross-modal representation for vision and language solely based on transformer architecture, excluding CNNs and RNNs. Without any structural modifications or sophisticated common embedding scheme, and without additional cross-modal pre-training with millions of samples, our model demonstrates comparable performances to state-of-the-art crossmodal models. Since we deliberately chose the smallest baseline models for each component, and a very simple concatenation scheme, we can intuitively expect an enhanced performance by selecting larger pre-trained models at the cost of more parameters, or by selecting more sophisticated common embedding scheme. The same holds true for the amount of pre-training data used, as we can reasonably expect the performance to boost by using the same amount of pre-training data employed by previous models. With transformer's relative computational efficiency as reported by (Dosovitskiy et al., 2020), the architectural integrity proposed in our model is likely to lead to new research direction, and we hope to encourage more advanced models with novel ideas to follow in near future.