MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

,


Introduction
Self-supervised representation learning with transformer models (Vaswani et al., 2017) has become the dominant technique in Natural Language Processing in recent years, with encoder transformer models trained using a Masked Language Modeling (MLM) objective (Devlin et al., 2019) excelling at Natural Language Understanding tasks, and autoregressive decoder models (Radford et al., 2018(Radford et al., , 2019;;Brown et al., 2020) displaying impressive Natural Language Generation at increasingly large scales.Vision Language (VL) modeling -the modeling of joint image-text representations for tasks such as image captioning or visual question answering (VQA) -has followed suit, with the transformer encoder becoming the prevalent architecture in recent research.A popular approach among the latest state of the art VL models is to use a BERT-style encoder language model (LM) in combination with an object detection backbone such as Faster-RCNN (Ren et al., 2015).This approach, while displaying impressive performance on challenging benchmarks, has a number of drawbacks (see Section 2), in particular not being able solve VL tasks in an open-ended, generative fashion.
A recent line of work (Tsimpoukelli et al., 2021;Wang et al., 2021;Sollami and Jain, 2021) explores VL modeling using autoregressive decoder models trained with a language modeling objective.SimVLM (Wang et al., 2021) shows impressive performance, but requires prohibitively large amounts of pretraining data and the training of language and vision components in tandem.Frozen (Tsimpoukelli et al., 2021) shows that a pretrained autoregressive language model can, without any finetuning to the LM weights themselves, be harnessed to train a visual prefix which enables images to be used as its input.While the performance of Frozen on VL benchmarks falls short compared to the state of the art, we feel the approach is promis-ing due to its practicality, and the public availability of large, pretrained LMs such as GPT-J (Wang and Komatsuzaki, 2021), PanGu-α (Zeng et al., 2021), and GPT-Neo (Black et al., 2021).
Extending the Frozen approach, in this paper we introduce a framework to combine existing unimodal language and unimodal vision models pretrained on large web datasets into a powerful multimodal model.Specifically, our contributions are: i) We introduce MAGMA: An autoregressive VL model that is able to generate text from an arbitrary combination of visual and textual input.Like Frozen, we start from a fixed large LM and a visual encoder-prefix stack.MAGMA differs from Frozen by additionally augmenting the LM with adapter layers, and using CLIP's (Radford et al., 2021) visual component as encoder.Only training the adapters and visual components, the method is parameter efficient and naturally retains the LM's encyclopedic knowledge and in-context learning abilities.ii) Pretrained on a simple next token prediction objective, MAGMA is competitive in several VL downstream tasks, significantly outperforming its predecessor, Frozen, while pretraining on ∼0.2 % of the number of samples used for SimVLM (Wang et al., 2021).In particular, MAGMA achieves state of the art accuracy on the OKVQA benchmark, which we evaluate as a fully open-ended generative task.iii) Our extensive ablations on the vision encoder and adapter components show i) that a pretrained CLIP ResNet encoder outperforms other visual backbones, ii) that an adaptertuned model outperforms a visual prefix-only method, and iii) that different adapter configurations excel at different downstream tasks.iv) We show that a carefully curated pretraining dataset -including around 25 million imagetext pairs from a wide range of sources, including downstream task training data -can dramatically increase downstream performance when compared to a noisier, web-scraped dataset (CC12M Changpinyo et al. (2021a)).
We only explore the VL domain in this work, but we expect the general method of a modalityspecific prefix in combination with adapter layers and a frozen LM to apply equally well to other combinations of modalities, such as audio-text pairs.With this publication, we open-source our code and release a trained model checkpoint.1

Related Work
VL models of the past years (Zhang et al., 2021;Li et al., 2020;Chen et al., 2019;Li et al., 2019;Su et al., 2020;Tan and Bansal, 2019) harness a BERT-like encoder transformer as the language component, trained with a MLM objective -where random words in the input are masked out, and the model is tasked with predicting them.Encoder VL models are often also pretrained with auxiliary objectives or custom cross-modal losses, such as the Masked Region Modeling, Image-Text Matching and Word-Region Alignment of UNITER (Chen et al., 2019), or the contrastive loss of OSCAR (Li et al., 2020).Using auxiliary cross-modal loss functions and pretraining tasks complicates the pretraining procedure by requiring these losses to be properly balanced.Additionally, encoder models need extra task-specific finetuning for each task to perform effectively, limiting their accessibility.In comparison, autoregressive VL models like MAGMA are trained on a single, simple next token prediction objective, and can perform well on a wide range of tasks without further finetuning.
Two predecessors to our method are Frozen (Tsimpoukelli et al., 2021) and SimVLM (Wang et al., 2021), two autoregressive decoder models trained with a next token prediction language modeling objective.Frozen affixes an NFResnet (Brock et al., 2021) vision encoder to a pretrained autoregressive LM and, keeping the LM weights frozen, trains the vision encoder along with a visual prefix that linearly maps the output of the vision encoder to the dimensionality of the LM's token embeddings.Frozen shows that autoregressive VL models have the ability to adapt to examples in-context, like their language only counterparts (Brown et al., 2020), without performing any gradient updates.When shown multiple examples of a task in its context window in Few-Shot learning, its performance on that task improves, it appears to 'learn' from the presented examples without task-specific finetuning.Our model has similar in-context learning capabilities, but the addition of adapters and the different choice of visual backbone results in a model with improved performance when trained on a comparable dataset, see Section 3.3.
SimVLM is similar to Frozen, but pretrains the vision and language components in tandem using a prefix LM objective.SimVLM consists of an encoder-decoder transformer with a combined ResNet (He et al., 2015) and ViT (Dosovitskiy et al., 2020)  Our work builds on recent advances in parameter efficient finetuning of LMs (Houlsby et al., 2019;Li and Liang, 2021;Lester et al., 2021;He et al., 2021;Hu et al., 2021), specifically with adapter layers (Houlsby et al., 2019;He et al., 2021;Pfeiffer et al., 2020), which are small modules inserted in between the elements of a transformer layer which are finetuned instead of the model weights as a form of parameter efficient fine-tuning.
For a visual backbone, it is common to use region features from a pretrained object detection model such as Faster- RCNN (Ren et al., 2015).These are generally trained using expensive human labeled data on a bounded set of object classes, limiting the number of object types the resulting model can recognize.On the other hand, contrastive models such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) present a more robust approach to learning visual features by learning joint representations between image-text pairs.They show strong performance on a wide variety of vision tasks as well as impressive generalization abilities that can provide powerful semantic guidance to image generation (Esser et al., 2021).But since they were only trained to match image-text pairs, they cannot inherently be used for tasks that require text generation as output (Shen et al., 2021).
However, Shen et al. (2021) show that the weights of contrastive language-image models contain useful semantic information for VL tasks.By replacing the conventional region-based backbone with CLIP's visual encoder in popular VL architectures, the authors achieve SOTA results across a wide variety of VL tasks without needing regionbased features, motivating us to use CLIP's visual component as a vision encoder for MAGMA.Notably, we confirm their finding that the ViT variant of CLIP underperforms on VL tasks when compared to the ResNet variant, particularly in tasks that require localization within an image.

Method
Our general approach is an image conditioned variant of soft-prompting or prefix tuning (Lester et al., 2021;Shin et al., 2020;Qin and Eisner, 2021) for language transformers and extends the Frozen method (Tsimpoukelli et al., 2021).The core idea is to translate image features into language embeddings carrying visual information which can therefore be interpreted by the language transformer without need to retrain the latter from scratch.

Architecture
The model can be broken down into four main components, see Figure 2. First, images are fed into a Visual Encoder, which processes the raw image input and outputs a sequence of feature vectors.Then an Image Prefix module maps image features into a sequence of embedding vectors that are input to the third model component, an auto-regressive Language Model.The fourth component is a series of Adapter layers which are inserted into the transformer LM, and tuned during training.We discuss the four components in more detail below.
Visual Encoder -V e The visual encoder is a network used to extract condensed semantic information about an image.In principle, the visual encoder could take the form of any deep vision network whose output can be mapped to a sequence of embedding vectors.For our ablations, we use the visual backbone of several variants of CLIP.We also train a model with an NFResnet encoder trained from scratch, which is analogous to the model presented in Frozen, see §4.2.The visual encoder output is then passed into the Image Prefix.
Image Prefix -V p Before the encoder output can be input to the LM, it needs to be translated into a sequence of n d h -dimensional vectors, where d h is the LM's hidden dimension.For the CLIP encoders, we extract the feature grid before the pooling layers, resulting in an N × N grid, where N = 7, 7, 12 for the ViT-B/32, RN50x4 and RN50x16 variants of CLIP respectively.We then flatten the feature grid into a sequence of N 2 vectors, and linearly transform the vectors' channel dimension to d h .For the NFResnet variant, we follow the procedure described in Frozen by linearly transforming the output to d h • n, where n can be an arbitrary sequence length which we set to 2. Finally, we apply dropout regularization to the output of the image prefix, followed by Layer Normalization.We also explored non-linear variants of prefix mappings, replacing the linear transformation with an MLP and a transformer encoder, but found no improvements.
Language Model -E, T, H The language backbone of our architecture is initialized from a pretrained auto-regressive transformer LM similar to GPT (Radford et al., 2018).
A text input y is converted into a sequence of tokens t 1 , ..., t m .Then a word embedding layer E maps each token t k to a unique vector e k = E(t k ) ∈ R d h , obtaining a sequence of embeddings e 1 , ..., e m which are input to a transformer-decoder module T with a causal attention mask.A language model head H maps the final output embeddings of the transformer to logits over the token space which can be used in a cross-entropy loss function for a next-token-prediction training objective and to auto-regressively generate text during inference.Because any sequence of vectors v 1 , ..., v m ∈ R d h can be used as input to the transformer, we can use images as input after mapping them through the encoder and the prefix as described above.
For the LM component, we use the open sourced weights of the 6 Billion parameter GPT-J (Wang and Komatsuzaki, 2021) LM.Since its architecture is largely similar to that described in Radford et al. (2018), we will not cover it in this paper, but do note two key differences of GPT-J compared to the original GPT architecture.Firstly, GPT-J replaces learned positional embeddings with rotary positional embeddings (Su et al., 2021), a form of relative positional embedding.As noted in (Tsimpoukelli et al., 2021), relative positional embeddings enable the transformer to generalize to inputs with more than one image, or a different imagetext ordering compared to the training distribution, which is key to the VL model's ability to perform in-context learning with multiple image examples.Secondly, the attention layer and the feedforward layer are computed in parallel for decreased communication costs (Wang and Komatsuzaki, 2021).
Adapters -{A i } Adapters are a series of small modules placed in between elements of a transformer model (Houlsby et al., 2019), that can be finetuned instead of the model weights as a form of parameter efficient fine-tuning.We use the framework of He et al. (2021), where the adapter layers take the form of a scaled residual bottleneck MLP: The matrices φ is an activation function (in our case ReLU) and λ i is a scaling parameter that is either trained or set equal to 1.We refer to the ratio d h /d b as the downsample factor of the adapter.Given a set of adapters {A i } and a transformer module T , we denote the adapted version of T by T , which means replacing the attention and/or feedforward blocks B i of T by their adapted version Bi , either obtained from adding the adapters in parallel or sequentially: We experiment with both parallel and sequential adapter variants, see Section 4.2.1 for results.

Training
During training, the weights of the LM E, T, H remain unchanged, whereas the weights of the image encoder V e , image prefix V p and the adapters {A i } are optimized.The language model components are initialized with weights from the pretrained GPT-J model and the image encoder is initialized with pretrained CLIP weights except for the NFResnet ablation, where the image encoder is randomly initialized.The image prefix and adapters are always trained from scratch.In the following we denote the trainable parameters of a module by the subscript θ.As described in 3.1, a set of trainable adapters {A i,θ } gives rise to the modified transformer module Tθ .
The training objective is a captioning task: given an image-caption pair (x, y), we embed the image as v 1,θ , ..., v n,θ = V p θ • V e θ (x) and the text as e 1 , ..., e m = E(t 1 ), ..., E(t m ), where {t k } is the tokenized caption y.Note that the image sequence length n is fixed while the length of the caption m is variable.The image embeddings are then prepended to the text embeddings and fed through the adapted transformer module.Denoting the embedding-to-logits function as l θ = H • Tθ , we then compute the loss where l θ (v 1,θ , ..., v n,θ , e 1 , ..., e i ) is interpreted as next-token log-probability conditioned on the previous sequence elements l θ (v 1,θ , ..., v n,θ , e 1 , ..., e i ) = log p θ (t i | x, t 1 , ..., t i−1 ). (4) For technical details regarding training, see A.

Dataset
For pretraining we use two different large scale datasets, one for the ablations and another one for our final model MAGMA base , respectively MAGMA long .For all ablations in 4.2 we train on CC12M (Changpinyo et al., 2021a) for a total of around 3M samples ensuring comparability with Frozen.Unfortunately, CC12M performs hypernyming, replacing people names with ⟨PERSON⟩.This causes downstream models to output ⟨PERSON⟩ overwhelmingly often, even when the inputs do not contain people or places.This failure mode, as well as recent research suggesting that increased training dataset diversity improves downstream generalization capabilities (Zhang et al., 2021;Radford et al., 2021;Brown et al., 2020;Gao et al., 2021), prompted us to construct another large-scale pretraining dataset from various publicly available image-text datasets, including a heavily filtered subset of LAION (Schuhmann et al., 2021), Wikipedia Image-Text (Srinivasan et al., 2021), CC3M (Changpinyo et al., 2021b), Visual Genome (Krishna et al., 2016), Localized Narratives (Pont-Tuset et al., 2020).

Experiments and Analysis
To evaluate our methodology, we first train a series of ablations (cf.§4.2), to break down the effects of the vision encoder and adapter choice.We evaluate these ablations, and all subsequent models on a range of visual question answering and image captioning tasks designed to quantify the model's ability to adapt to new tasks using in-context learning, recognize a wide variety of objects, and reason in detail about an image -often involving complex spatial understanding, encyclopedic world knowledge, and optical character recognition (OCR).

Visual Question Answering (VQA)
VQA tasks require the model to answer a question about the input image.Breaking from previous works, which generally formulate VQA tasks as classification tasks over the most frequent re- sponses in the training set, we formulate all VQA tasks as open-ended generative tasks to enable fewshot prompting.We use the following datasets: VQA 2.0 (Antol et al., 2015).A large and commonly used dataset for VQA where samples consist of an image, a question regarding the content of the image and 10 corresponding ground-truth answers.OKVQA (Marino et al., 2019).A VQA dataset where correct answers require explicit outside world knowledge not contained in the picture.GQA (Hudson and Manning, 2019).A large VQA dataset focusing on visual and spatial reasoning.VizWiz (Gurari et al., 2018).A dataset in the same format as VQA with questions asked by visually impaired people.The ground-truth to a question about an image may be "unanswerable" or "unsuitable", which has to be recognized by the model.
To compare the generated model output with the provided ground-truths, we apply the normalization procedure of the official VQA 2.0 repo,2 and truncate the model output to the length of the longest ground truth answer.For VQA, OKVQA, and VizWiz we calculate the accuracy metric from the official VQA paper (Antol et al., 2015), and for GQA we use the canonical accuracy score.
For few-shot settings, we use the procedure described in Tsimpoukelli et al. (2021), prepending n random examples of completed tasks before each question answer pair.We preprend "Q: " and "A: " to each question and answer respectively, improving performance (as exemplified in Figure 3).

Image Captioning
Image captioning tasks require the model to generate accurate descriptions of input images in natural language.We evaluate on two datasets -CoCo Captions (Chen et al., 2015) and NoCaps (Agrawal et al., 2019), measuring performance using the BLEU@4 and CIDEr metrics.NoCaps is designed to evaluate a model's ability to caption images containing uncommon or novel object classes that don't appear in CoCo.
Like with SimVLM, prompting with "A picture of" dramatically increases downstream scores, e.g. for MAGMA long , increasing the CIDEr score on CoCo Captions from 7.5 to 57.1.All scores reported in Table 1 use this as a prefix.Other prefixes, such as "Caption:" have a similar effect.

Visual Entailment
We test Visual Entailment performance on SNLI-VE (Xie et al., 2018), a task built on top of SNLI (Bowman et al., 2015).SNLI-VE requires the model to reason about the relationship between an image premise, P image , and a text hypothesis, H text .Given P image and H text as input, the task is to label their relationship as either entailment, neutral or contradiction.We formulate SNLI-VE as a classification task by finetuning the model together with a linear classification head on the last-layer transformer embedding of the last text token.

Ablations
We run two series of ablations: i) One designed to test the impact of the adapter layers and their precise configuration, and ii) another designed to test the impact of the vision encoder choice.We also independently replicate the Frozen model (see Table 1), using the pretraining setup described in their paper (with the exception that we pretrain on CC12M) to use as a baseline.All ablations are trained for a total of 15k steps, or around 3.8 million image-text pairs.

Adapter Types
We run ablations with several different adapter configurations, motivated by He et al. (2021) showing that the precise formulation of the adapter layer can have a large impact on the performance of a model on downstream tasks.Also, different adapter layers can perform better than others depending on the task.Since an exhaustive sweep in the parameter space of adapters is very expensive, we decided on seven configurations, including models with no adapters, to get a qualitative picture of the effect on downstream performance.We use the same visual encoder (CLIP 'RN50x16') for all adapter ablations and evaluate the open-ended few-shot scores on the VQA and Image Captioning tasks described in 4.1.1 and 4.1.2respectively.The results are shown in Table 1.Although there is no adapter configuration which clearly outperforms the rest, we observe three key points: Applying adapters to the attention layer is key.Adapter configurations with no adapters on the attention layer underperform, particularly at few shot prompting.
More adapter parameters to the feed forward layer increases performance on knowledgebased tasks.The adapter variant with more parameters allocated to the feed forward adapter outperforms other variants on OKVQA and NoCaps tasks requiring outside knowledge and uncommon object classes recognition respectively.This supports pre-liminary research indicating that the feed-forward blocks are important in storing implicit knowledge in pretrained transformers (Dai et al., 2021).
Balancing attention and feed-forward parameter allocation aids scene understanding.The adapter variant with equal number of parameters allocated to the attention and the feed forward adapters excels at the GQA benchmark, a QA benchmark built around scene graphs and designed to focus on skills such as spatial reasoning, comparisons, and object and attribute recognition.

Visual Encoders
We run ablations with four different image encoders: NFResnet, CLIP-ViT-B/32, CLIP-RN50x4 and CLIP-RN50x16.All visual encoder ablations are trained using the adapter configuration with sequential adapters on the feed-forward block and a downsample factor of 4. The results are shown in    at VQA tasks.However, the difference between RN50x16 and RN50x4 is slight, with the smaller encoder performing better on VQA and OKVQA, while the larger encoder has a much higher GQA accuracy.We hypothesize that the increased resolution of the larger feature grid results in a more detailed scene understanding, while the smaller grid is better at condensing global visual information, which also shows in the Image Captioning scores, where CLIP-RN50x4 excels.
CLIP-ViT has the worst average score across question answering tasks.This reinforces the finding of Shen et al. (2021), who find that the CLIP-ViT model struggles at tasks which require localization within an image.
Recall that the image prefix length varies between image encoders which may have a confounding effect on the results -further study is needed to disentangle the effects of sequence length and the choice of the vision encoder.

Final Model
Based on our ablation studies, in particular the average VQA scores, we opt to train a final MAGMA model using the CLIP-RN50x16 encoder and sequential adapters with a downsample factor of 8 applied to the feed-forward and attention layers.We train on the dataset detailed in §3.3 and see that evaluation loss does not plateau after ∼3M samples as reported in Frozen, and so continue training, resulting in two model variants -MAGMA base trained for 15k steps for comparability to Frozen, and MAGMA long trained for 7.6M samples.
Due to the inclusion of training splits of tasks like VQA in the pretraining dataset, the performance of MAGMA base significantly exceeds the downstream performance of previously trained ablations.The evaluation is conducted in the same way as the zero-shot procedure for the ablations and to avoid cluttered notation, we refer to it as such, although "zero-shot" usually refers to solving tasks unseen in pretraining.We stress that the pretraining set and the eval sets are still disjoint.
While the scores of MAGMA long already surpass the VQA-finetuned variants reported in Frozen, we find that we can further increase the single-task performance on the training sets of each benchmark described in §4.1 by finetuning on them.After finetuning, MAGMA achieves competitive scores across all benchmarks, setting a new state of the art accuracy on OKVQA, as well as attaining strong scores on the NoCaps benchmark -to our knowledge, being surpassed only by SimVLM and VinVL (Zhang et al., 2021), see Table 2.
We include several qualitative results, which highlight strengths of the model we feel are not sufficiently reflected by the evaluations in Table 1.Notably, MAGMA appears to be less easily fooled by the adversarial typographic attacks to which CLIP is susceptible (Goh et al., 2021), see Figure 5. Additionally, MAGMA shows impressive OCR capabilities even without supervised finetuning, see Figure 4, which warrants further quantitative evaluation.Interestingly, if a word or phrase is truncated, MAGMA can often impute the missing text.We also include an example of a multi-step factored cognition prompt (Mishra et al., 2021), see Figure 6, where a challenging task is broken down into atomic steps.We suspect that task decomposition may enable MAGMA to perform complex tasks that it would otherwise be unable to solve.

Conclusion
We propose a simple framework for Multimodal Augmentation of Generative Models through Adapter-based Finetuning -demonstrating that it is possible to transform multiple unimodal models into a powerful multimodal VL model while keeping the weights of the language component frozen.Our model, MAGMA, trained using adapter layers and a simple next token prediction objective, can perform competitively with state of the art VL models on a wide range of benchmarks, excelling at tasks requiring external knowledge and the recognition of uncommon object classes.
We hope our results will act as a starting point for further research into augmenting pretrained language models with additional modalities.

Limitations
Although the performance of MAGMA is impressive, we note some current limitations with the model and autoregressive VL models in general.Firstly, as we observed in the Image Captioning tasks, LMs can be sensitive to input -performance is heavily dependent on the prompt format.
Secondly, although the model can perform incontext learning with multiple examples in its context window, it struggles to reason over multiple images, as it was only pretrained on single imagecaption pairs.Finally, MAGMA shows similar capabilities to large LMs like GPT3, about which there are ongoing ethical concerns regarding their reproduction of biases from the training data, as well as concerns relating to how to effectively align their outputs to human goals.As such, further research into the reproduction of visual biases, and the guiding of model outputs is needed.

Figure 1 :
Figure 1: An example output produced by MAGMA.For this and all following examples the input text is displayed in black, and the model's response in green.

Figure 2 :
Figure 2: MAGMA's architecture.The layers in red are trained, and the layers in blue remain frozen.

Figure 3 :
Figure 3: An example of a 2-shot prompt for OKVQA.

Figure 4 :
Figure 4: MAGMA's OCR capabilities.Even when text is obscured, MAGMA imputes the missing values.

Figure 5 :
Figure 5: An example of an adversarial typographic attack which MAGMA appears robust to, unlike CLIP.

Figure 6 :
Figure 6: Example of multi-step prompting.Using the output of the model (left) again as the input (right), the generation procedure is broken down into atomic steps.

Table 1 :
Performance evaluation on downstream tasks.Open-ended few-shot evaluation on VQA-val, OKVQA-val, GQA-testdev and VizWiz-val.Captioning evaluation on NoCaps-val and CoCo-val.Models under MAGMA pretrained are trained on the mixed dataset detailed in Section 3.3, all other models are trained on CC12M.Notation for adapter ablations.Type: (s)caled or (p)arallel.λ: 1 or (t)rained.Attn, FF: Downsample factor of the bottleneck in the resp.position.-means not applied.Params: Number of trainable parameters relative to the ablation with sequential FF adapters with downsample factor 4.

Table 1 .
Our findings are the following: CLIP-RN50x16, on average, performs best

Table 2 :
MAGMA finetuned performance.B@4: NoCaps-all score.SOTA scores are to the best of our knowledge at the time of writing.If available/applicable, we compare