On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or do they combine they reason jointly over modalities? We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in e-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of language models, do not consistently improve self-rationalization in multimodal tasks. We find that no single model type works universally best across tasks, datasets, and finetuning data sizes. Our findings motivate the need for novel general backbones approach that move text generation from images and text beyond image captioning.


Introduction
The pretrain-finetune paradigm has changed the field of NLP. Inspired by its success, there has been an explosion of interest in multimodal pretraining (Su et al., 2020;Lu et al., 2019;Chen et al., 2019;Li et al., 2020a;Tan and Bansal, 2019;Li et al., 2020b;Gui et al., 2022a). To enable text generation from images, captioning is often included as one of the pretraining tasks Gupta et al., 2022;. Captioning is also the only generative task used to evaluate and compare * Work undertaken while Shruti Palaskar and Ana Marasović were at the Allen Institute for AI. ♣ Shruti Palaskar is currently at Apple. joint models, for which only minor improvements are reported relative to classification tasks (Cho et al., 2021;. Moreover, captioning is conditioned only on a single image. This leads us to ask: Do recent advances transfer to more complex generative tasks? Can generation condition on both images and text? Another line of work skips joint pretraining and directly modifies and finetunes a pretrained language model apt for generation (e.g., GPT-2; Radford et al., 2019) on multimodal datasets Sollami and Jain, 2021;Eichenberg et al., 2021;Gui et al., 2022b). This approach has distinct benefits and downsides compared to models based on joint pretraining (see Table 1), but these two families of models are rarely compared. This leads us to other questions: Given a new generative task, which approaches should be used or combined?
We study these questions through the lens of a newly emerging and important, but challenging task of self-rationalization (Wiegreffe et al., 2021): jointly generating both the task label/answer and a free-text explanation for the prediction. Standard tasks for studying multimodal self-rationalization present different levels of difficulty. Explaining VQA is similar to captioning since corresponding explanations describe visible information in the image that is relevant to the context (Fig. 1, left). On the other hand, VCR instances require higherorder reasoning about unstated information such as commonsense (Fig. 1, right). Since VCR answers are full sentences, self-rationalization consists of two generative sub-tasks.
We evaluate the following models: (i) a joint vision-language model, VLP , (ii) a pretrained language model, T5 (Raffel et al., 2020), that we visually adapt only through finetuning, and (iii) VL-T5/VL-BART (Cho et al., 2021), a combination of the previous two approaches. Namely, VL-T5/VL-BART are developed from T5 and BART (Lewis et al., 2020) by doing a second round of pretraining, using multimodal datasets and objectives. We finetune all models for selfrationalization in two data settings: (i) using entire finetuning datasets (high-resource setting), and (ii) using 30% of the data (low-resource setting).
We first present an analysis of the factors influencing performance of the visually adapted T5: the choice of image representation ( §4.1) and T5's model size ( §4.2). We demonstrate that recent advances in representing images, namely CLIP features, can be easily leveraged to get more accurate visually adapted T5 models. However, these improvements are not realized for a more complex sub-task of explanation generation. Moreover, unlike for text generation conditioned only on text (including self-rationalization of tasks with textual inputs; Marasović et al. 2022) we do not see clear performance improvements from scaling visually adapted PLMs. Finally, the main comparison of the three model types described above ( §4.3) shows that no model type works universally the best across tasks and data regimes. These results demonstrate that outside of image captioning, there is no clear choice of model backbone or training regime in image conditioned text generation. We aim to motivate research on multimodal model comparisons across generative tasks and experimental setups, to help realize benefits of different model families (Table 1) with a single approach.

Text Generation from Images: Models
Vision-and-language (VL) learning currently comprises two families of models: (i) joint VL models that are pretrained from scratch using data with both modalities ( §2.1), and (ii) vision-adapted language models-pretrained language models adapted to the visual modality through finetuning using end-task multimodal datasets ( §2.2). Some models combine these two approaches to some extent, so they could be the best of both worlds ( §2.3).
In Table 1, we overview reasons for why one model family might be preferred over the other. In Tables 2-3, we outline model specifications and list image sources. For generative tasks conditioned on images, including self-rationalization of VL tasks, the choice of the best base model family is not obvious. The aim of this paper is to find whether such a choice exists.

VLP: Joint Vision-and-Language Model
We use VLP  to analyze importance of benefits of joint VL pretraining from scratch relative to other approaches.
VLP is a shared encoder-decoder transformer (Vaswani et al., 2017) of size similar to BERT-base (110M parameters; Devlin et al., 2019). It is pretrained with objectives similar to both masked and standard language modeling. Thus, it is suitable for discriminative as well as generative tasks. A given input image is represented with vector representations of a fixed number of regions obtained with an off-the-shelf object detector . During finetuning, the same object detector representations should be used. The Conceptual Captions dataset (Sharma et al., 2018), containing about 3M web-accessible images and associated captions, is used for pretraining. 1 Designed for some text generation task (e.g., image captioning, language modeling) ✓ Some Some 2 Offered in larger sizes (related to better text generation performance) ✓ 3 Large textual pretraining data (related to capturing world/commonsense knowledge) ✓ ✓ 4 Easy plug-and-playing with the latest pretrained LMs and image representations ✓ 5 Tight coupling between modalities ✓ ✓ 6 Expected to be robust when multimodal training data is limited ✓ ✓ Table 1: A comparison between: training vision and language (VL) jointly from scratch, adapting pretrained language models (PLM) to visual features, and models that somewhat combine these two approaches, w.r.t. desired model properties for self-rationalization. Text generation typically improves with model size (Brown et al., 2020), incl. self-rationalization (Marasović et al., 2022). Due to huge pretraining corpora, PLMs have been shown to capture some world (Petroni et al., 2019) and commonsense knowledge (Davison et al., 2019) which is beneficial for self-rationalization as the task often requires inferring relevant information from the inputs (see Fig. 1).

VA-T5: Vision-Adapted Pretrained LM
Vision-adapted training involves starting with a pretrained language model (PLM) and adapting it to VL tasks as a finetuning step. We start with T5 (Raffel et al., 2020)-a PLM that is commonly used for self-rationalization (Narang et al., 2020;Hase et al., 2020;Wiegreffe et al., 2021;Marasović et al., 2022)-and finetune it for self-rationalization of VL tasks. Specifically, we concatenate image representations with representations of textual inputs, feed the result to the subsequent PLM's layers, and train using language modeling loss on generated answer and explanation tokens.
A question that emerges is: what kind of image representations should be used? While vector representations of image regions extracted with an object detector are the most frequent choice, most recently advantages of representations from the CLIP model (Radford et al., 2021) have been demonstrated for various applications (Shen et al., 2022). Moreover, we wonder whether using automatic image captions is the way to visually adapt a PLM given that the input will then be completely textual, i.e, in the modality a PLM has seen before. Thus, in §4.1, we compare three different features: (i) auto-generated captions from the off-the-shelf image captioning model (VL-T5 model; Cho et al., 2021), (ii) object features from a pre-trained R-CNN model (Ren et al., 2015), and (iii) CLIP features (obtained from the last layer of ViT-B/32). Besides exploring different image representations, visually adapting a PLM allows us to study model scaling. In §4.2, we study the following sizes of T5: Base (220M parameters), Large (770M), and 3B. We refer to visually adapted T5 as VA-T5.

VL-T5 / VL-BART: Combined Models
Another approach is to start with a PLM, do a second round of pretraining to learn joint VL representations, and finally finetune the model to learn the end-task. This approach can be seen both as a joint model and a visually adapted PLM, and thus may offer benefits from the both model families.
To compare this approach with the others, we use VL-T5 and VL-BART (Cho et al., 2021). VL-BART, a multimodal extension of BART-Base (139M parameters; Lewis et al., 2020), also follows an encoder-decoder transformer but does not share the parameters between the encoder and decoder as is done in VLP. To represent images, it uses the Faster R-CNN model for object detection (Ren et al., 2015). Specifically, vector representations of a fixed number of image regions are are concatenated with text embeddings and fed into VL-BART, which is then pretrained using masked language modeling objectives in addition to new objectives such as visual question answering, image-text matching, visual grounding, and grounded captioning. VL-T5 is similar in spirit to VL-BART, where it is initialized with a T5-Base model (220M parametes; Raffel et al., 2020) correspondingly. T5 is trained for various downstream tasks jointly, whereas BART exploits a task-specific encoder-decoder set up for sequence generation tasks. Both VL-BART and VL-T5 are pre-trained with MS COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), VQA v2 (Goyal et al., 2017), GQA (Hudson and Manning, 2019), and Visual7W (Zhu et al., 2016), leading to a total of 9.18M imagetext pairs on 180K unique images.  Table 2: Model specifications. PT stands for "pretraining", MLM for "masked language modeling", "Img. Feat." for "Image Features". Sources: BERT (Devlin et al., 2019), UniLM (Dong et al., 2019), BookCorpus (Zhu et al., 2015), T5 (Raffel et al., 2020), C4 (Raffel et al., 2020)

Experimental Setup
In this section, we describe tasks, datasets, and evaluation setup for self-rationalization of vision-andlanguage tasks introduced in prior work (Marasović et al., 2020;Kayser et al., 2021).

Tasks and Datasets
The dataset statistics are given in Table 4, where the average answer and explanation lengths hint on differences in complexities of each task. The three datasets represent different levels of required reasoning (see examples in Figure 1), e.g., are the images representing simpler scenarios (Flickr30K) or complex movie scenes (VCR). VQA-X (Park et al., 2018) is the extension of the widely-used Visual Question Answering v1 (Antol et al., 2015) and v2 (Goyal et al., 2017) datasets, with corresponding free-text explanations. The images here are originally sourced from the MSCOCO dataset (Lin et al., 2014), and the answers are collected for open-ended questions about these images that require vision, language, and commonsense knowledge to answer.

Dataset
Image Sources   and (ii) E-SNLI (Camburu et al., 2018), a dataset of crowdsourced free-text explanations for SNLI. VCR (Zellers et al., 2019) is a carefully crowdsourced dataset of answers and explanations for visual scenes extracted from Hollywood movies. Thus, the visual context in this data is more complex than MSCOCO or Flickr30K images, leading to more complex answers and explanations. Zellers et al. instructed crowdworkers to first annotate answers for a given question-image pair, and then showed the annotated answer along with the question-image pair to a different set of annotators to get the corresponding explanation. This dataset was orginally introduced in a classification setting: given a question about an image, pick the correct answer from four choices, and then pick the correct explanation again from four choices. Dua et al. (2021) propose instead to generate both the answer and explanation. This is more realistic than the multiple-choice setting which is restricted to a user giving answer choices. In this paper, we also generate VCR answers and explanations.

Evaluation Metrics
Self-rationalization requires evaluating two subtasks: correctness of predicted answers/labels, and quality of generated explanations. For the former, we use accuracy for E-SNLI-VE and VQA-X. Evaluating VCR answers is more complicated as they are full sentences. Following Dua et al. (2021), given a generated answer, we normalize text (remove articles, punctuation, lowercase), and count the number of overlapping words with the four available answer choices in the VCR dataset.
We select an answer candidate with the highest overlap as the predicted answer. Proxy accuracy is the accuracy that is computed between the correct answer candidate and predicted answer candidate. Dua et al. (2021) do not report the correlation between proxy accuracy and human judgments of answer plausibility. To this end, for 600 VCR instances, we ask 5 crowdworkers to respond to "Given the image and the question, is this answer likely?" with yes, weak yes, weak no, or no, and map answers to 1, 2/3, 1/3, and 0, respectively. Answer plausibility is the average of scores of 5 annotators. In §4, we report the average answer plausibility across 600 instances, as well as proxy accuracy. Spearman's correlation coefficient between the proxy accuracy and answer plausibility is 0.56 (p < 0.028) indicating a moderate correlation between them. Automatic metrics are unreliable, so human evaluation has been used to evaluate free-text explanation generation (Camburu et al., 2018;Kayser et al., 2021;Clinciu et al., 2021). We ask 3 annotators whether an explanation justifies an answer/label given an image, questions/hypothesis, and a generated answer/label. Annotators pick one of the four options (yes, weak yes, weak no, no), and the four options are assigned numerical values (1, 2/3, 1/3, 0). We average scores of 3 annotators to get the plausibility of an individual explanation, and report the average explanation plausibility in a sample of 300 instances. Following Kayser et al. (2021), we select the first 300 instances for which the answer/label is correctly generated. For E-SNLI-VE, we select an equal number of examples for each label to produce a balanced evaluation set. Human evaluation used Amazon Mechanical Turk. 1 Kayser et al. (2021) report that all automated metrics are weakly correlated with explanation plausibility (per humans), but that BERTscore is most correlated. Therefore, we report BERTscores for completeness and reproducibility. Following Kayser et al. (2021), we set the BERTscore of incorrectly predicted instances to 0.

Results
To study whether there is a base model family that is more suitable for text generation conditioned on images and text, we compare: (i) a joint visionand-language model, VLP ( §2.1), (ii) a visually adapted PLM, VA-T5 ( §2.2), and (iii) VL-T5 / VL-  BART, a combination of the previous two model families ( §2.3), for self-rationalization of the three tasks in Figure 1 and Table 4. The benefits and downsides of (i) and (ii) are outlined in Table 1. Before presenting the outcomes of this comparison in §4.3, we study the impact of the choice of image features and model size on VA-T5's performance.

VA-T5: Analysis of Image Features
Visually adapting PLMs allows us to combine different image features with PLM's text representations. We analyze VA-T5-Base (finetuned with the full training data) with different image features in the input: auto-generated captions, object, and CLIP features (see §2.2 for more information). We also report a control setting where no image features are used (None). In Table 5a, we report answer (proxy) accuracy and plausibility (for VCR), and in Tables 5b and 5c explanation BERTscore and plausibility. Metrics are described in §3.2 and hyperparameters are reported in the Appendix.

Results
We observe that CLIP features give the best accuracy scores for all three datasets (Table 5a). This result demonstrates the benefit of visual adaptation: advances in image representations, such as CLIP, can be effortlessly used, unlike with joint models for which we need to re-train the model jointly from scratch with these new representations. In §2.2, we hypothesize that captioning could be a straightforward way to bridge two modalities to get the most out of a PLM that is already wellpositioned to solve the end-task in one modality (text). However, with the exception of VQA-X, captions give the worst accuracy scores relative to object and CLIP features.
We turn to evaluation of plausibility of generated explanations which paints a different picture (Table  5c). We observe that CLIP and object features perform similarly for VCR and VQA-X-object features are even slightly better for VQA-X. In other words, advances from CLIP features diminish for the more complex task of explanation generation. Captions work best for generating E-SNLI-VE explanations with VA-T5-Base, but not for the other two datasets. However, E-SNLI-VE is an outlier in another way. It is the only task for which having no image features is better for explanation generation than having CLIP features, and just slightly worse (0.3 points) than having object features. Notably, CLIP/object features require combining vectors from different models while captions are represented with the same pretrained word embeddings as the rest of the input. We thus explore whether the way that layer normalization is applied to the concatenated vectors is crucial, but we find that it is not (see Appendix). We leave further analysis for why visual adaptation gives only minor improvements for generation of E-SNLI-VE explanations with VA-T5 relative to other datasets to future work.
BERTscore results (Table 5b) are mixed. According to them, CLIP features are the best for VCR and the worst for VQA-X. Moreover, the differences between BERTscore values obtained with different features are very small (0.0-0.4) which makes these results hard to interpret.

VA-T5: Analysis of Model Size
Another advantage of visually adapting PLMs is that we can use larger model sizes since PLMs are typically more frequently scaled relative to joint models. The benefits of scaling the model and pretraining data size are outlined in Table 1. We explore three model sizes for VA-T5 ( §2.2): Base (220M), Large (770M), and 3B. We use CLIP features to visually adapt T5 for these experiments since they give more accurate VA-T5 models, while generating explanations that are similarly plausible to those generated by T5 adapted with object   Table 7: Comparison of a joint VL mode (VLP), visually adapted pretrained LM (VA-T5), and combined models (VL-BART, VL-T5) on three datasets: VCR, E-SNLI-VE, and VQA-X. We report (proxy) answer accuracy and plausibility (for VCR), and explanation BERTscore and plausibility. See §2 for more information on models and §3 for tasks, datasets, and evaluation metrics.
features (see §4.1). We use the full training sets to finetune VA-T5 models in this section.

Results
Scaling the model size from 220M (Base) to 770M (Large) parameters gives more accurate models for VCR and VQA-X, but further scaling to 3B parameters degrades performance. This is in contrast to self-rationalization of textonly inputs where performance monotonically increases with the T5's model size (Marasović et al., 2022). E-SNLI-VE is an exception with no clear pattern between the model size and accuracy. Moreover, explanation plausibility decreases with the model size (Base > Large > 3B) for E-SNLI-VE and VQA-X. This is also in contrast to observations in text-only self-rationalization. The exact opposite is true for the plausibility of generated VCR explanations which increases with the model size (3B > Large > Base). Notably, in Table 1, we report that the larger model and data size are correlated with capturing more world and commonsense knowledge, and generating VCR explanations requires more inferring about information that is unstated in the input relative to generating E-SNLI-VE or VQA-X explanations. This might explain the difference between why scaling is beneficial for VCR and not for other datasets.
Unlike accuracy and plausibility for which model scaling is helpful at least to some extent, BERTscore values decrease monotonically when scaling the model size. Despite reservations about this evaluation metric given its weak correlation with human judgements of explanation plausibility, BERTscore values are increasing monotonically for self-rationalization with textual inputs as expected (Marasović et al., 2022). Thus, we see this result as another evidence of the difficulty of visually adapting larger models rather than the limitations of BERTscore as an evaluation metric.
A better understanding of what is the bottleneck for visually adapting larger PLMs is needed. We might need other ways to visually adapt besides the simple input changes that have been done so far.

Joint Models vs. Visually Adapted PLMs
We turn to our main comparison between a joint model (VLP), a visually adapted PLM (VA-T5-CLIP), and combined models (VL-BART, VL-T5). Given that joint models might be advantageous when finetuning data is limited, we compare the models when finetuned with: (i) the entire training sets (high-resource data setting), and (ii) 30% of the training data (low-resource data setting).
Results In a high-resource setting (Table 7a), the best answer accuracy/plausibility is achieved by a different model for each dataset. To illustrate, VA-T5 (a visually adapted PLM) obtains the best VCR answer proxy accuracy, VLP (a joint model) VCR answer plausibility, VL-T5 (a combined model) E-SNLI-VE accuracy, and VL-BART (another combined model) VQA-X accuracy.
Explanation plausibility results are slightly more consistent (Table 7a). Namely, VL-BART generates most plausible explanations for E-SNLI-VE and VQA-X, and is only behind VLP for VCR. The best explanation BERTscore is also achieved with VL-BART for all tasks. However, the relative order of model types (joint, visually adapted, combined) across tasks is still mixed. Specifically, for VCR: joint > combined (both) > adapted (all); for E-SNLI-VE: combined (both) > adapted (all) > joint; for VQA-X: combined > adapted > joint > adapted > combined > adapted.
It is not necessarily concerning that the results are mixed given the unique benefits and downsides of these models (see Table 1) that could be relevant for one task and not another. However, observed cross-task differences in results are not intuitive. For example, visually adapting T5 and combined models give worse VCR explanation plausibility compared to plausibility obtained with the joint model, VLP. Since generating VCR explanations require the most commonsense and word knowledge relative to the other two tasks, it is reasonable to expect that this is a scenario where long pretraining using a more complex text will be beneficial, but it turns out it is not.
We now turn to results in low-resource data setting (see Table 7b). We hypothesized that joint models might work better when finetuning data is limited since there might not be enough images to appropriately visually adapt PLMs and they might still behave like (unimodal) language models. However, VA-T5 is not always underperforming relative to VLP, VL-BART, and VL-T5 in the low-resource setting. Specifically, VA-T5 is slightly better for explanation generation compared to VL-BART and VL-T5 for VCR, comparable to VLP for E-SNLI-VE, and better than all of the three models for VQA-X. It is also better than VLP for generating VCR answers. These results show that the size of finetuning data is not as detrimental for visual adaptation relative to joint models as we speculated.
As in the high-resource setting, we observe that no model (type) works universally the best. Model ordering according to their performance sometimes stay consistent compare to the high-resource setting, e.g., for E-SNLI-VE accuracy, and sometimes changes notably, e.g., VLP gives the best VCR answer and explanation plausibility in high-resource setting, but the worst when data is limited. These results highlight the necessity to compare models, not only using a variety of tasks/datasets, but also models finetuned with different amounts of data.
We see that the differences between model performance in high-resource are smaller than in lowresource, where in some cases the gap is huge. For example, VL-BART achieves VCR answer plausibility of 42.8 in low-resource, while VLP results in only 21.1. Another example is VQA-X answer accuracy for which VL-BART achieves 85.6 and VLP 71.8. Unlike VCR, this can be explained by the fact that VQA is one of the tasks used to pretrain VL-BART (and VL-T5). Such huge differences between models are not observed for explanation generation, so even though a model is much more accurate, the plausibility of explanations for its correct answers are not that much more plausible compared to explanations of other models.

Conclusions
We extensively analyze different multimodal models that have unique benefits and downsides for text generation conditioned on images and text beyond image captioning. We focus on self-rationalization (jointly generating labels/answers and free-text explanations), and show that there is no single approach that works best across instances of this complex domain. A key question moving forward is how best to leverage unimodal advances.
In the meantime, our findings can be used as intermediate guidelines for which model to choose: • Unlike for most text-only tasks, larger visually adapted language models, do not give better results. Our results suggest starting with T5-Large (770M parameters). • Although not always the best, CLIP features are a reasonable choice for visual adaptation. • Do not eliminate visual adaptation if your multimodal dataset is small. • If your multimodal data is not limited, VL-BART is a reasonable baseline for multimodal self-rationalization. Otherwise, multiple models should be compared.

Limitations
While we examine multiple methods that were available to us while we were conducting this research, it is inevitable that new multimodal models will be released, leaving the question of whether those models are already superior for multimodal self-rationalizing open. For example, the recently proposed model OFA (Wang et al., 2022a) could be particularly suitable. It is available in large sizes (33M, 93M, 182M, 472M, 930M), trained with a filtered version of the large-scale text corpus PILE (140GB; Gao et al., 2021) as well as with a variety of multimodal datasets and objectives including image captioning and grounded captioning.
Besides that, all of the models we examined have been trained on data in the English language and using clean, high quality images from similar data sources (MS COCO, Flickr). Inherent biases stemming from this source of data would need to be studied in future work towards scaling this work to multiple languages and other image sources (for e.g. noisy, dense context, adversarial images). Our main measure of explanation quality is plausibility that does not answer whether these plausible explanations are useful in real-world applications of VQA and NLI to actual stakeholders. Another limitation are our computational resources. With access to even more compute, we would be able to examine at a larger scale such as T5 11B or more.