KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We adapt the generative BART architecture (Lewis et al., 2020) to a multimodal model with visual and textual inputs. We further develop novel pretraining tasks to improve the model performance on the Visual Commonsense Generation (VCG) task. In particular, our pretraining task of Knowledge-based Commonsense Generation (KCG) boosts model performance on the VCG task by leveraging commonsense knowledge from a large language model pretrained on external commonsense knowledge graphs. To the best of our knowledge, we are the first to propose a dedicated task for improving model performance on the VCG task. Experimental results show that our model reaches state-of-the-art performance on the VCG task (Park et al., 2020) by applying these novel pretraining tasks.


Introduction
Early work on Vision-Language models has been largely focused on pure understanding tasks (Tan and Bansal, 2019;Lu et al., 2019). These models, although improving model performance on understanding tasks such as Visual Question Answering (Antol et al., 2015), are not capable of multimodal generation tasks (You et al., 2016). To ease this problem, researchers have proposed various models (Zhou et al., 2020; for generating texts based on visual inputs.
These models are mainly pretrained on general visual and language understanding tasks such as masked language modeling and masked region modeling, which enable the models to build an * The first three authors contribute equally to this work. alignment between visual and language features. However, only feature alignments are inadequate to enhance the model's ability in conducting complex multimodal commonsense reasoning, which requires the model to understand the underlying relations and effects between objects.
Commonsense reasoning was traditionally studied on natural language (Rajani et al., 2019;Trinh and Le, 2018), while recent works have paid attention to commonsense reasoning with joint visual and language inputs. For instance, Zellers et al. (2019) proposes the task of Visual Commonsense Reasoning (VCR). However, the task focuses on understanding instead of generating as it asks the model to answer multiple-choice questions. A newly introduced dataset, Visual Commonsense Generation (VCG) (Park et al., 2020), provides a more challenging task by requiring the model to generate commonsense inferences about what might happen before/after, and the present intents of characters (see Table 2 for an example). In this work, we propose to tackle the task of VCG by leveraging our Knowledge Enhanced Multimodal BART (Lewis et al., 2020), which we call KM-BART. KM-BART is a Transformer-based model consisting of an encoder and a decoder and is pretrained on carefully designed tasks for VCG. Our contributions in this work are three-folded: 1. We extend the BART model to process multimodal data of images and texts, and enable multimodal reasoning by introducing taskrelevant tokens.
2. To improve the model performance on Visual Commonsense Generation (VCG), we implicitly incorporate commonsense knowledge from external knowledge graphs to our KM-BART by designing a novel pretraining task, which we call Knowledge-based Commonsense Generation (KCG).
3. Besides KCG, we further equip our KM-BART with standard pretraining tasks including Masked Language Modeling (MLM), Masked Region Modeling (MRM), as well as Attribution Prediction (AP) and Relation Prediction (RP). Experimental results show that all pretraining tasks are effective, and combining these pretraining tasks enable our KM-BART to achieve state-of-the-art performance on the VCG task.
2 Related Work

Vision-Language Models
Visual-Language (VL) tasks such as Visual Question Answering (VQA) (Antol et al., 2015) and Image-Text Matching (Li et al., 2019) require the models to process multimodal inputs and comprehend visual and textual information simultaneously. Inspired by successful pretrained language models like BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), numerous multimodal imagetext pretraining and representation learning models (Tan and Bansal, 2019;Lu et al., 2019;Chen et al., 2020; have been proposed. These multimodal pretrained models use Transformers as backbone and are denoising autoencoders trained to predict the alignment of imagetext pairs and the semantics of masked words and image regions. The models mentioned above typically focus more on understanding tasks. To further bridge the gap between visual and textual clues in multimodal data, in addition to cross-modal understanding, a model should also acquire abilities to complete generation tasks, for example, the image-to-text task of Image Captioning (You et al., 2016). However, directly transferring a model pretrained on VL understanding tasks to generation tasks is infeasible, as these models are merely Transformer-based encoders and are thus not suitable for generation tasks. Zhou et al. (2020) ease this problem by using a Transformer-based network as both an encoder and a decoder, making the model capable of generating texts based on visual and textual inputs. While  propose OSCAR, which improves the generation ability by introducing object tags as an additional clue during pretraining. These models achieve state-of-the-art performance in downstream multimodal generation tasks such as Image Captioning (You et al., 2016).

Commonsense Knowledge
Commonsense knowledge refers to the necessary level of practical knowledge and reasoning about everyday situations and events common among most people (Sap et al., 2020). For example, one should know that "water is for drinking" and "sunshine makes people warm". Simple as it looks, enabling artificial intelligence to conduct commonsense reasoning has been difficult for learning-based models (Gunning, 2018). Researchers have resorted to knowledge graphs due to their exact graph-structured representation of knowledge to overcome this problem. For example, ConceptNet (Speer et al., 2017) is a knowledge graph with nodes representing general concepts and edges indicating relational knowledge between concepts. Another commonsense knowledge graph, ATOMIC , extends nodes to natural language phrases, and edges to relations such as intent, attribution, effect, etc.
Despite improvements in modeling commonsense knowledge, graph-based methods require heavy human engineering, making it challenging to scale robustly. For instance, model performance usually deteriorates dramatically when retrieved contextual knowledge is noisy due to imperfect knowledge matching (Lin et al., 2019). Therefore, we implicitly leverage external knowledge using supervision signals inferred by COMET (Bosselut et al., 2019), which is a Transformer-based, generative model pretrained on commonsense knowledge graphs including ConceptNet and Atomic. Given a natural language phrase and a relation type, COMET generates natural language commonsense descriptions.
In summary, on the one hand, existing crossmodal architectures not focusing on commonsense interpretation as their pretraining tasks are designed for multimodal understanding, making them unsuitable for the downstream VCG task. On the other hand, Transformer-based generative models such as COMET (Bosselut et al., 2019) cannot generate commonsense inferences from cross-modal inputs. Therefore, in this work, we propose KM-BART to conduct the task of Visual Commonsense Generation (VCG). Our KM-BART is pretrained on a

Methodology
In this section, we describe our methodology for Visual Commonsense Generation. Section 3.1 gives our model architecture. Section 3.2 introduces our pretraining tasks as well as our self-training based data filtering technique. Figure 1 illustrates the architecture of our KM-BART. The backbone of our model is BART (Lewis et al., 2020), which is a Transformer-based sequence-to-sequence autoencoder. We modify the original BART to adapt the model to crossmodality inputs of images and texts. We add special tokens to adapt the model to different pretraining/evaluation tasks. In the following subsections. We give the details of our visual feature extractor, the encoder, and the decoder.

Visual Feature Extractor
Following previous work on Vision-Language models ( where d is the embedding dimension. For each of the RoIs, the Masked R-CNN also outputs the class distribution p(v i ), which is later used for Masked Region Modeling.

Cross-Modal Encoder
Following Lewis et al. (2020), the encoder of our model is based on a multi-layer bidirectional Transformer. We introduce special tokens to adapt it to our pretraining and downstream evaluation tasks. Specifically, each example starts with a special token indicating the task type of the current example. For our pretraining task of Knowledge-Based Commonsense Generation (see Section 3.2.1), we use <before>, <after>, or <intent> as the starting special token. For Attribution Prediction and Relation Prediction (Section 3.2.2), we use <region caption>. Finally, for Masked Language Modeling and Masked Region Modeling, we use <caption>.
Furthermore, to inform the model of different modalities of inputs, we add three sets of different special tokens: For images, we use <img> and </img> to indicate the start and the end of visual embeddings, respectively. For texts, we introduce different special tokens to distinguish between two sets of textual inputs: events and captions. Events are image descriptions which the model uses for reasoning about future/past events or present intents of characters in the commonsense generation task, while captions are for Masked Language Modeling, where linguistic information plays a more important role. Hence, to inform the model of these two types of textual inputs, we use <event> and </event> for events, and <mlm> and </mlm> for captions. In the following sections, we denote textual inputs of words and specical tokens by W = {w 1 , .., w T }, where T is the length of textual inputs. For a token w, its embedding is e ∈ R d , where d is the dimension of the embeddings.

Decoder
The decoder of our model is also a multi-layer Transformer. Unlike the encoder, which is bidirectional, the decoder is unidirectional as it is supposed to be autoregressive when generating texts. The decoder does not take the visual embeddings as inputs. Instead, we use embeddings of the special token <img feat> to replace the actual visual embeddings. For Masked Region Modeling and Masked Language Modeling, we use <cls> to replace the masked regions or words (see Figure 1). The model should predict the masked words and the class distribution of the masked regions during pretraining.

Pretraining Tasks
To pretrain our model, we use four image-text datasets: Conceptual Captions Dataset (Sharma et al., 2018), SBU Dataset (Ordonez et al., 2011), Microsoft COCO Dataset (Lin et al., 2014) and Visual Genome (Krishna et al., 2017). In the remaining of this section, we use D to denote the individual datasets for each of the pretraining tasks. Statistics of the datasets are given in Table 1. The above datasets consist of examples of parallel images and texts and are widely used in previous work (Tan and Bansal, 2019;Lu et al., 2019;Zhou et al., 2020;.

Knowledge-Based Commonsense Generation
The knowledge-based commonsense generation (KCG) task aims to improve the performance of KM-BART on the VCG task. We leverage knowledge induced from COMET (Bosselut et al., 2019), which is a large language model pretrained on external commonsense knowledge graphs. Given a natural language phrase and a relation as inputs, COMET generates natural language phrases as commonsense descriptions. Relations of COMET include xIntent, xWant, xNeed, xReact and xEffect.
We only use COMET to generate new commonsense descriptions on SBU and COCO datasets due to limits in computational power for pretraining. For each image-text pair, we use COMET to generate commonsense descriptions from the text using all five relations mentioned above. To adapt COMET generated commonsense knowledge to VCG, we consider relations xIntent and xWant from COMET as intent, xNeed as before, xReact and xEffect as after. In this way, we generate additional commonsense knowledge for SBU and COCO datasets. The newly generated dataset has more than 3.6 million examples (Table 3). However, the generated commonsense knowledge is not always reasonable as only textual information is used while the visual information is completely ignored. To ease this problem, we further filter the dataset by employing a self-training based data filtering strategy. Self-Training Based Data Filtering Our strategy aims to filter the generated commonsense knowledge dataset so that the examples in the filtered dataset closely resemble the examples in the VCG dataset. To achieve this goal, we first initialize our KM-BART with BART parameters and finetune KM-BART on the VCG dataset for 30 epochs. The finetuned KM-BART already has a good performance on the VCG dataset with a CIDER score of 39.13 (see Table 4).
We then leverage this finetuned model to evaluate the quality of commonsense descriptions generated by COMET. We feed the corresponding images, texts, and relations as inputs to the finetuned KM-BART and then compute the cross-entropy (CE) loss of COMET generated commonsense descriptions. We observe that commonsense descrip-  tions with a lower CE loss make more sense than those with a higher CE loss. Notice that when computing the CE loss of the COMET generated commonsense descriptions, our KM-BART leverages both the textual inputs and the visual inputs.
We provide examples of our data filtering strategy in Supplementary Material. We compute CE loss for all the commonsense descriptions in the VCG dataset and the new dataset generated by COMET. Figure 2 shows the distributions of CE loss for the two datasets. We observe that commonsense descriptions generated by COMET result in higher CE losses, which are expected as images are completely ignored when using COMET to generate natural language commonsense descriptions. We only keep the examples of which CE loss is below 3.5. Table 3 shows the statistics of generated datasets before and after data filtering. By filtering, we keep only 1.46 million examples, roughly accounting for 40% of the original examples.
Finally, we leverage the newly generated commonsense knowledge dataset by pretraining KM-BART on it. We expect by pretraining, the model reaches higher performance on the VCG dataset.  Let S = {w 1 , ..., w L } be a commonsense description of the newly generated dataset D, the loss function for KCG is: where L is the length of the generated sequence, l is the index to individual tokens in the target commonsense description S, V and W are visual inputs and textual inputs, respectively. θ represents model parameters to be optimized.

Attribute Prediction and Relation Prediction
The Visual Genome dataset consists of 2.3 million relationships and 2.8 million attributes. To utilize these data, we use the attribute prediction (AP) and the relation prediction (RP) as pretraining tasks, which enable the model to learn intrinsic properties among different objects in an image.
In the AP task, we feed the output vectors of the decoder for each image feature into an MLP classifier. In the RP task, we concatenate two output vectors of the decoder for each image feature pair and feed it into another MLP classifier. We use the cross-entropy loss for both tasks.
We denote the indices for AP by 1 ≤ j ≤ A, the indices for RP by 1 ≤ k ≤ R, where A is the number of AP examples, and R is the number of RP examples. We denote the label for the j-th AP example by L a (v j ), and the label for the k-th RP example as L r (v k 1 , v k 2 ), where v k 1 and v k 1 are the two RoIs of the current RP example. The loss function for the AP task is: And the loss function for the RP task is:

Masked Language Modeling
Following previous works (Devlin et al., 2019;Liu et al., 2019), we randomly mask the input textual tokens with a probability of 15% in the Masked Language Modeling (MLM) task. Within this 15% of the tokens, we use <mask> to replace the masked token with a probability of 80%, use a random token to replace with a probability of 10%, and keep the masked token unchanged with a probability of 10%.
We denote the mask indices by 1 ≤ m ≤ M , where M is the number of masked tokens. We denote the masked token by w m , and the remaining tokens that are not masked by w \m , the loss function for MLM is defined as:

Masked Region Modeling
In the Masked Region Modeling (MRM) task, we sample image regions and mask the corresponding feature vectors with a probability of 15%. The masked vector will be replaced by a vector filled with zeros. The model needs to predict the distribution over semantic classes for the masked regions. The loss function is to minimize the KL divergence of the output distribution and the distribution predicted by the Masked R-CNN used in visual features extraction. We denote the mask indices by 1 ≤ n ≤ N , where N is the number of masked regions. We let p(v n ) denote the class distribution of the masked region v n detected by Masked R-CNN, q θ (v n ) denote the class distribution output by our model, the loss function for MRM is then:

Combining Losses
To combine all the losses we described above, we weight each of the losses by W KCG , W AP , W RP , W M LM , W M RM ∈ R. The weights are chosen to roughly balance every term during the training phase. The final loss is:

Experiments
We describe our experiments in this section. Section 4.1 is the experimental settings of different pretraining and initialization strategies. Section 4.2 gives the evaluation task and metrics. We show our results in Section 4.3. In Section 4.4, we give example inferences generated by our model. We have the human evaluation results in Section 4.5.

Settings
In our experiments, following the base model from Lewis et al. (2020), we fix the model architecture to a 6-layer encoder and a 6-layer decoder. To understand how each pretraining task helps model performance on the downstream task of VCG, we ablate on pretraining tasks. We use the following experimental settings: (1) Without any pretraining; (2) Only with Knowledge-based Commonsense Generation; (3) Only with Attribute Prediction and Relation Prediction; (4) Only with Masked Language Modeling and Masked Region Modeling; (4) With all the pretraining tasks combined. For only with Knowledge-based Commonsense Generation, we further compare the model performance before and after data filtering (see Section 3.2.1). For each of the above settings, we initialize the model from random or from BART weights, respectively. Besides, we are most interested in the model performance under two settings (see the second column of Table 4): (1) Only using images as inputs; (2) Using both images and event descriptions as inputs. Note that when only using images as inputs for evaluation, we also do not use textual inputs during pretraining/finetuning.

Evaluation Task and Metrics
We evaluate our model on the recently proposed Visual Commonsense Generation (VCG) Dataset (Park et al., 2020). Given an image and a description of the event in the image, the task aims to predict events which might happen before/after, and the present intents of the characters in the given image. The dataset consists of 1174K training examples and 146K validation examples. Some examples in the dataset share the same images or events, but with different inferences for events before/after or intents at present. Table 2 gives an example of the dataset. We report our model performance on the validation set as the test set is not available yet.
Besides event descriptions, the VCG dataset also provides Place and Person information for each image. Note that although Park et al. (2020) also leverages the Place and Person information for training and evaluation, we argue that such information is not generally available in normal settings, where only images and event descriptions are given. Hence, we do not use the Place and Person information in our KM-BART. As an additional reference, we nevertheless show in Table 5 the best performed models from Park et al. (2020), which also use Place and Person information.
We use three automatic evaluation metrics, including BLEU-2 (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014), and CIDER (Vedantam et al., 2015). Following Park et al. (2020), we also report Unique as the number of inference sentences unique in generated sentences divided by the total number of sentences, and Novel as the number of generated sentences not in the training data divided by the total number of sentences.

Results
We first ablate on different pretraining tasks to understand the effect of each task. We then combine all the pretraining tasks together to train our full  Table 5: Results on VCG validation set with nucleus sampling. Following Park et al. (2020), we use nucleus sampling with p = 0.9 to generate five inference sentences for each example during evaluation. * : we directly use evaluations from Park et al. (2020). Bold: best performance. Italic: second best performance. Modalities: information used during training. Event: whether or not event descriptions are used during evaluation.
model. As a last step, we pick the best performed models to compare against previous state-of-the-art system (Park et al., 2020). Table 4 shows the effect of each pretraining task to our KM-BART on the VCG dataset. We can see that all our pretraining tasks help improve model performance. Most importantly, we observe that although filtering on the commonsense generation pretraining task reduces the dataset size by more than 60%, pretraining with KCG still reaches comparable or better performance than pretraining with KCG (before filtering). This demonstrates that our self-training based filtering technique is helpful, as it helps the model reach similar or even better performance with less training data. The advantage is most evident when we initialize from BART parameters and use both images and event descriptions as inputs. Under this setting, pretraining with KCG outperforms pretraining with KCG (before filtering) in terms of all the evaluation metrics.
For using both images and event descriptions as inputs, the model performs better when initialized from pretrained BART parameters. As pretrained BART can better leverage the information in the event descriptions. Hence, to obtain our full KM-BART model for using images and events as inputs, we adopt the setting of initializing from BART parameters. Experimental results show that our full model † reaches high performance on BLEU-2, METEOR and CIDER, and that the full model † generates the most unique and novel inferences.
For using only images as inputs, models initializing from random parameters outperforms those initialized from BART parameters. We argue that initializing from BART parameters results in optimization disadvantages where the model has to switch from pure textual inputs to pure visual inputs. This observation becomes evident as the model performs the worst when no pretraining is used, which indicates that the model has to entirely rely on finetuning on the VCG dataset to adapt to visual inputs. Therefore, for using only images as inputs, we obtain our full KM-BART model by initializing from random parameters. Our full model § reaches best performance on BLEU-2, METEOR and CIDER, and is the second best in terms of Unique.
In Table 5, we compare our full model to previous state-of-the-art (Park et al., 2020). 3 We observe that although our full model † taking as inputs images and event descriptions does not use Place and Person information, the model still outperforms previous state-of-the-art (Park et al. (2020) c ). For using only images as inputs, our model § also performs better than previous results (Park et al. (2020) b ). Furthermore, our model § reaches comparable performance to Park et al. (2020) a in terms of BLEU-2, METEOR and CIDER, with much higher performance on Uniqueness and Novelty, even though our model § uses much less information during training compared to Park et al. (2020) a .

Case Study
In Table 2, we show example inferences and compare the results of our model predictions to the ground truths. The generated sentences from the model without event descriptions as inputs can already capture the most important information of commonsense. We also observe that adding event descriptions to the inputs helps the model generate more details. We gives more examples of our model in the Appendix.

Models
Event Before After Intent Total Park et al. (2020) Table 6: Human Evaluation results. We compare the inference generated by our best model under the setting of with event or without event. † and § indicate corresponding models in Table 4. We use Park et al. (2020) c * for both with event and without event as Park et al. (2020) only release the weights of this model.

Human Evaluation
We conduct human evaluation to further understand how humans perceive the inferences generated by our KM-BART. We employ a comparison approach for a better assessment between our KM-BART and the model from Park et al. (2020). To be specific, we randomly sample 30 examples from the VCG validation set. For each example, we use our KM-BART or the baseline model to generate 5 sets of inferences, each of which consist of the task type before, after, and intent. We use two settings for our human evaluation: (1) With event: event descriptions are given as input during inference time; (2) Without event: event descriptions are not given during inference time. Under each of the settings we compare our KM-BART model with the mode from Park et al. (2020). We use the same 30 examples for each model under the two settings. For each example in a task type (before, after, or intent), we generate 5 inferences for one model of each setting. In total, we generate 450 inferences for each model of each setting during the human evaluation.
For the same example, we use our KM-BART and the model from Park et al. (2020) to generate an inference under one of the three task types, then the workers choose the more reasonable inference from the two generated inferences. We hire three workers from Amazon Mechanical Turk 4 to evaluate each inference. We take the majority of the three workers as the final evaluation for an inference. Among all the inferences, we use the percentage of one model better than another model as the score of that model. For example, in Table 6, the score of our model (Ours § ) is 61.3 for the task type before when event descriptions are missing. This indicates that our model is better than the baseline model for the task type before in 61.3% of the cases. We also From Table 6, we can observe that our model outperforms Park et al. (2020) under both of the settings. To be specific, when event descriptions are not given, among all the inferences, our model is better than Park et al. (2020) in 66.7% of the cases. Furthermore, our model has a lead of at least 22.6% over Park et al. (2020) in each individual task. For example, our model generates better inferences in 68.7% of the cases in task type after, while the model from Park et al. (2020) is only better than our model in 31.3% of the cases. We can obtain similar results when looking at the task type before and intent.
When event descriptions are given, our model is still better than Park et al. (2020) in 55.1% of all the cases. For each individual task, the advantage of our model is smaller when event descriptions are given than when event descriptions are not given, showing that our model can better capture information from the images.

Conclusion and Future Work
In this paper, we propose Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based model capable of reasoning about and generating commonsense descriptions from cross modality inputs of images and texts. We propose the pretraining task of Knowledge-Based Commonsense Generation, which improves the reasoning ability of KM-BART by leveraging a large language model pretrained on external commonsense knowledge graphs. We use the self-training technique to filter the automatically generated commonsense descriptions. Experimental results on the VCG task show that our KM-BART pretrained on the pretraining tasks reaches state-of-the-art performance. Further human evaluation demonstrates that our KM-BART can generate commonsense inferences of high quality.
For future work, we plan to further expand our pretraining dataset for Knowledge-Based Commonsense Generation by including the Conceptual Captions Dataset (Sharma et al., 2018). Furthermore, while we argue that Place and Person information is not generally available in practical scenarios, we still plan to add Place and Person information to our model in the future.