VLIS: Unimodal Language Models Guide Multimodal Language Generation

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.


Introduction
Visual Language Models (VLMs) extend unimodal text-only language models by conditioning their outputs on image context.Recent VLMs (Li et al., 2022a(Li et al., , 2023b;;Wang et al., 2022) can perform diverse multimodal tasks from commonsense VQAs (Marino et al., 2019;Schwenk et al., 2022) to incontext learning (Alayrac et al., 2022;Awadalla et al., 2023;Huang et al., 2023).Moreover, instruction tuning with visual inputs (Liu et al., 2023;Li Figure 1: TOP: VLIS correctly recognizes named entities, unlike the VLMs.Bottom: VLIS is not deceived by the distractor images.Note that the images show a seagull and a monkey, not an ostrich and a chimpanzee.VLIS inherits this favorable linguistic capability from a text-only language model (Touvron et al., 2023;Zhang et al., 2022), and use VLMs as a guide for visual alignment.The examples are truncated for visualization purposes: we provide the full-text in appendix A.2. using both a strong image captioning model (BLIP-2 (Li et al., 2023b)) and an instruction-tuned model (LLAVA (Liu et al., 2023)).Firstly, VLMs avoid specifying named entities.The upper examples of Figure 1 show the VLM failing to describe a public figure (Diego Maradona) or movie character (Don Corleone).The problem is not the lack of knowledge: after applying our zero-shot method (VLIS), the VLM tells the names.We further investigate this phenomenon in the landmark recognition experiment in appendix A.1.
Secondly, VLMs rely on the image context, even when they should not.The lower examples of the same figure show the VLM being misled by image context to deny commonsense knowledge.The questions are not unanswerable: the text-only language model without the image context answers both correctly.We provide more samples on visual distraction in appendix A.2.
Hence, the linguistic capabilities of the VLMs are not optimal yet.
On the other hand, the unimodal text-only language models themselves (Brown et al., 2020;Touvron et al., 2023) show reliable linguistic understanding and known for their knowledge understanding (Petroni et al., 2019;Meng et al., 2022) and complex reasoning capabilities (Kojima et al., 2022;Qin et al., 2023).Hence, it becomes reasonable to delegate the burden of language modeling to the text-only models.
To this end, we propose Visual-Language models as Importance Sampling weights ( VLIS) as a plug-and-play method to enhance the unreliable linguistic understanding of the VLMs.When generating each text token, VLIS follows the token likelihoods of the unimodal text-only language model.Furthermore, VLIS multiplies importance sampling (Tokdar and Kass, 2010) weights derived from a VLM to provide the visual alignment signals.To isolate the visual conditioning capability of the VLMs from their language modeling preference, we incorporate the exponentiated pointwise mutual information (PMI) (Church and Hanks, 1990) of the image context and the current text token as the weights.As a result, VLIS can maintain the favorable language modeling capability of the text-only model and control the visual conditioning strength simultaneously.
We evaluate VLIS on two VLM backbones to test whether VLIS is effective both when the language modeling capability of the VLM is weaker than that of the text-only model (BLIP-2 (Li et al., 2023b)) and when the VLM is expected to model language well owing to the visual instruction tuning process (LLAVA (Liu et al., 2023)).Our ex-periments consist of various tasks that require both reliable language modeling and strong visual conditioning, including weirdness identification (WHOOPS (Bitton-Guetta et al., 2023)) and commonsense VQA (OK-VQA (Marino et al., 2019), ScienceQA (Lu et al., 2022a)), extended image captioning (Concadia (Kreiss et al., 2022) and Image Paragraph Captioning (Krause et al., 2017)), and open-ended generation (ROCStories (Mostafazadeh et al., 2016)).Compared to the dataset-specific state-of-the-art baselines and the base VLMs, VLIS improves linguistic capabilities such as responsiveness to prompts while maintaining visual conditioning according to a comprehensive set of evaluation metrics.

VLMs as Importance Sampling Weights
We propose Visual-Language models as Importance Sampling weights (VLIS) to harmonize the visual conditioning capability of the VLMs with the linguistic fluency of the text-only language models.We provide the intuition behind our approach in §2.1, describe our token-level visual alignment scores in §2.2, and combine the said scores with the text-only model via importance sampling in §2.3.

Intuition
Many recent Visual Language Models (VLMs) (Li et al., 2023b;Alayrac et al., 2022;Liu et al., 2023) are often built on top of text-only language models (Iyer et al., 2022;Hoffmann et al., 2022;Touvron et al., 2023).At each timestep t, the per-token likelihood of the autoregressive text-only language models is modeled as p text (x t |x <t ), where x denotes a text token.To build a VLM p vl , one can finetune the text-only model on data S consisting of paired image c and text x with maximum likelihood estimation as the objective.
However, while this objective only maximizes the image-conditional likelihood p vl (x t |c), it may lead to unpredicted artifacts in the marginal likelihood p vl (x t ) that does not depend on any particular image.For example, image captioning models are known to not only reflect but also amplify the social bias present in the training data (Hendricks et al., 2018), or distort the original language model's commonsense knowledge as described in §1.
We henceforth seek to extract the visual conditioning capability of the VLMs isolated from their dubious language modeling skills.

Extracting Visual Weights
Here, we aim to find a quantity that extracts the visual conditioning strength of a VLM stripped of its language modeling preference.We employ Pointwise Mutual Information (PMI) (Church and Hanks, 1990), which measures the association between two events (text and image in our setting).On each step, we want to compute the PMI between the image context c and the next token x t given the previous text context x <t : eq. ( 3) reformulates the definition in eq. ( 2) for better tractability.The numerator is the imageconditional likelihood of the VLM and is easily obtainable.However, the denominator requires marginalization over the image context c.We enumerate three proxies below that bypass the excessive computation required to obtain the expectation over all possible images.
Approximating the marginal.The first approximation is training a separate text-only model with the VLMs' training data S. Considering the massive scale of dataset S, this option requires a considerable burden of additional training.Also, there is no guarantee that the newly trained model will accurately estimate the marginal likelihood due to the additional complexity of training another model.The second option is using a sample mean of the pre-selected image set as a proxy to the real mean.Lastly, the score for only one or two images might suffice as the sample image set.
We use the last method with the least computational overhead.Here, the sample set is a tiny set of images with close to no visual information.In practice, we use two images: a black-filled image c b and a white-filled image c w .
This efficient alternative works reasonably well in practice and is used in all our experiments.As a result, VLIS runs three forward passes of VLM (one for the conditional likelihood and two for the marginal likelihood) and a single pass of the textonly model on each step of the generation process.We explore more choices of selecting the marginal image set later in appendix C, which shows that our specific set of images provides a reasonable trade-off between generation quality and inference time.

Computing VLIS Scores
We start from the token likelihood of text-only language models p text (x t |c, x <t ).To control the degree of confidence in the text-only models' decisions, we introduce a language temperature τ to smooth or de-smooth the text-only distributions: Then, we multiply the exponentiated PMI introduced in §2.2 with the likelihood for better visual alignment.VLIS decides the next token x t with the following score f (x t ): Written as eq.( 7), VLIS performs importance sampling of the smoothed text-only model likelihood p text .Importance sampling (Tokdar and Kass, 2010) is a Monte-Carlo method of estimating a quantity v(x) from the nominal distribution p(x) with samples from another distribution called importance distribution q(x).The estimated quantity here is the text-only model likelihood ptext (x t ), the nominal distribution is the VLMs' image-conditioned likelihood p vl (x t |c), and the importance distribution is the marginal p vl (x t ).
Implementation-wise, we replace the expectation with a single sample (current generated text).Thus, VLIS effectively treats the current token candidate as sampled from the VLMs' marginal p vl (x t ) and reweigh its importance with the VLMs' conditional p vl (x t |c).
Fluency masking.The log visual weights P M I(x t , c|x <t ) of VLIS is a log-likelihood ratio and is unbounded.Hence, some extreme cases, such as tiny values of the marginal likelihood p vl (x t |x <t ) may overrule the language generation process of the text-only model, yielding degenerate text.To prevent such text degeneracy, we apply a fluency mask to our importance sampling score f (x t |x <t , c): only the tokens with text-only likelihood larger than the threshold α are allowed to be selected.We omit the dependency on the contexts x <t , c in the equation below for simplicity.
Intuitively, this mask filters out any token candidates the text-only model sees as the next token with a probability lower than α.We fix the fluency threshold to α = 0.001 in all experiments except for an alternative architecture (appendix E).Still, VLIS is not overly sensitive to the specific value of the fluency threshold.We conduct a hyperparameter search experiment to verify this in appendix D.
The token that maximizes this final score f (x t |c, x <t ) is greedily selected as the next token.When VLIS is combined with other decoding methods, such as beam search, the score substitutes the original token likelihood as per-token scores.

Experiments: Describing Facts
We verify that VLIS can alleviate the factual inaccuracy concern raised in Figure 1 with various multimodal benchmarks: weirdness identification §3.1, commonsense understanding §3.2, and scientific reasoning §3.3.VLIS consistently outperforms the backbone VLM and shows comparable factual correctness to the strong baselines.
Experimental setups.We explore two experimental setups.Our experiments on the WHOOPS dataset incorporate LLAVA (Liu et al., 2023) and Lynx (Zeng et al., 2023) as the VLMs and Vicuna 7B (Chiang et al., 2023) as the text-only model.In the VQA experiments, we use BLIP-2 OPT 2.7B (Li et al., 2023b)  1.3B (Iyer et al., 2022) as our backbones.1Note that the choices of model pairs are intentional: we impose similar computational requirements on both the VLM and the text-only model to limit the additional computational burden of VLIS.In both cases, we use the base VLM as a general baseline to evaluate the gain from VLIS.Also, to verify the contribution of the PMI weights, we implement Naïve Ensemble which simply multiplies the token likelihood of the VLM and the text-only model.Evaluation metrics.We evaluate closed-ended questions with binary (WHOOPS) and multichoice (ScienceQA) accuracy.The open-ended VQAs (OK-VQA and VQAv2) use the task-specific VQA metric (Antol et al., 2015).(Bitton-Guetta et al., 2023) is a visual commonsense benchmark to check a VLM's capability to understand images that defy commonsense.We adopt identification of weird images, a subtask of the WHOOPS benchmark, which tasks a model to discriminate potentially weird images.

WHOOPS
Approach and Baselines.Following the original paper (Bitton-Guetta et al., 2023), we employ pipelining to turn the original binary classification problem into a description generation problem.Specifically, pipelining means that a model first generates explanation-of-violation (EoV) description of the given two images, which is then VLIS Albert Einstein holding a smartphone is unusual because he lived before the invention of such technology.

VLIS
Duck swimming alongside baby rubber ducks in the image is unusual as they are not a real animal.
What is unusual about this image?VLM it is not common to see a professional in a field like science using a cell phone.

VLM
It is not very common to see a large bird with three ducklings at once.processed to the off-the-shelf text-only classifier GPT3 (Brown et al., 2020) to yield a binary decision on which image is weird.We use VLIS to generate such EoV descriptions.The pipelined baselines include EoV from the backbone VLM (LLAVA), conventional machine-generated captions, and ground-truth captions from the WHOOPS dataset.We also include pipeline-less BLIP-2 (both supervised and zero-shot) as a baseline.The same prompt we used for both VLIS and the backbone VLM is illustrated in appendix F.
Results.Table 1 and Figure 3 presents results with LLAVA (Liu et al., 2023), an instruction-tuned VLM.VLIS-generated weirdness explanations perform on par with the ground-truth captions, which are manually annotated to contain details necessary to identify the strangeness.Also, our method as a zero-shot method shows comparable performance to the supervised baseline BLIP-2.Interestingly, LLAVA alone cannot outperform conventional captions, even with instruction tuning and prompting.

Commonsense Understanding
Unimodal language models embody commonsense knowledge (Petroni et al., 2019;Davison et al., 2019;Tamborrino et al., 2020).If VLIS can inherit this commonsense understanding capability, it would outperform the base VLM in tasks requiring both commonsense and visual understanding.Here, we examine this possibility with a commonsense VQA benchmark of OK-VQA (Marino et al., 2019).Further, VLIS is also shown to maintain visual specificity in VQAv2 (Goyal et al., 2017).
Results: commonsense knowledge.In the OK-VQA (Marino et al., 2019)  shown in the VQA score, VLIS (Ours) preserves the VQA capability of the backbone VLM (BLIP-2).Note that Naïve Ensemble falls behind the text-only backbone (OPT-IML), offering a poor trade-off between visual and linguistic understanding.

Scientific Reasoning
ScienceQA (Lu et al., 2022a) evaluates multimodal science reasoning capability.Here, the goal of VLIS would be to improve the answers in the presence of image contexts (IMG) and preserve the answers from the text-only model in the absence of such visual context (TXT and NO).
Results.Table 3 demonstrates the findings in ScienceQA.On IMG split, VLIS significantly improves the text-only OPT-IML and Naïve Ensemble baselines.Also, VLIS maintains the performance of the text-only backbone in TXT and NO split.Finally, the base VLM (BLIP-2) falls behind by a wide margin, indicating that solid language understanding is necessary for scientific reasoning.

Experiments: Text Generation
In addition to factual knowledge, text-only language models manifest two critical capabilities: following prompt instructions and generating fluent and diverse text.We demonstrate that VLIS extends these qualities to the visual domain with contextualized captioning ( §4.1), paragraph captioning ( §4.2), and visual story generation ( §4.3).

Contextualized Captioning
Concadia (Kreiss et al., 2022) is an image captioning dataset with the additional context of a paragraph from the Wikipedia article.The dataset provides two types of annotations: caption, which takes the article into account and description, which ignores the article context.
Approach and Baselines.Following the original evaluation scheme (Kreiss et al., 2022), we generate a single text to compare against both the ground-truth caption and description.We include both supervised (Kreiss et al., 2022) and zero-shot (Socratic Model (Zeng et al., 2022)) baselines.
Result.In Table 4, VLIS outperforms the Socratic Model (Zeng et al., 2022) implementation based on a stronger language model (GPT-3 175B (Brown et al., 2020)).Interestingly, the base VLM (BLIP-2) and VLIS (Ours) show a completely different text style.VLIS captions are better aligned with caption-style, showing that our method reflects the Wikipedia article better than the baselines.On the other hand, the VLM generates descriptionstyle texts better.Still, VLIS captions are similar to the visually intensive caption (description) compared to all other baselines except for the VLM.

Paragraph Captioning
Image Paragraph Captioning (Krause et al., 2017) has paragraph-long captions that describe the image in finer detail than sentence-level captions.
Approach and baselines.We saw that neither the VLM nor the text-only model could follow the  Results.As visible in Table 5, VLIS greatly improves the base VLM (BLIP-2) to generate paragraph captions comparable to the supervised baselines.We provide an interpretation of this improvement in qualitative samples in appendix G: VLIS shows less text degeneracy than the base VLM, while keeping visual hallucination at a minimum unlike Naïve Ensemble.

Story Generation
Story generation is an open-ended generation task.To excel at it, VLIS should generate open-ended text without falling into text degeneracy, all the while staying close to the image context.
Approach and baselines.Unlike previous experiments, here we use a supervised text-only model (Su et al., 2022b) finetuned on text-only ROCStories (Mostafazadeh et al., 2016) dataset.Hence, we can safely assume that this specialist text-only model knows the language "better" than the VLM in story generation.We include both visually-conditioned (MAGIC (Su et al., 2022a)) and text-only (Contrastive search (Su and Collier, 2023)) baselines.Refer to appendix B for more details on the baseline results.
Results.nally, while the base VLM (BLIP-2) shows high image-text correspondence as reflected in high CLIPScore, it cannot generate an articulate story as its low performance on other scores shows.

Qualitative Results
Commonsense Understanding. Figure 4 illustrates zero-shot results in the OK-VQA dataset (Marino et al., 2019).In (a) and (b), the baselines including the base VLM and Naïve Ensemble fail to understand the intention of the question (kind of dog and native to north america).
While the text-only model understands the question better and suggests plausible answer candidates (pug and wolf ), it has no access to the visual inputs and ultimately outputs an incorrect answer.On the other hand, VLIS sensibly combines commonsense reasoning and visual context.
Results for images (c) and (d) depict the failure cases.In (c), VLIS follows the reasoning process of the text-only language model to deduce that the answer should be a type of material.However, as the VLM focuses on the frontal object (umbrella), VLIS wrongly concludes the answer is the mate- rial of that object (paper, which is coincidentally a flammable material as well).In (d), the textonly model produces an incoherent output (ocean).
VLIS inherits this misinterpretation and likewise generates an incorrect answer (water).In conclusion, VLIS induces coordination of the VLM's visual specificity and the text-only model's commonsense understanding but carries on the modeling insufficiency of the individual modalities.
Open-Ended Generation.Lastly, we demonstrate the open-ended generation capability of VLIS in Figure 5. Here, VLIS should condition its output on the diverse text prompt and the image.Unlike the base VLM, it clings tighter to the prompt and produces realistic self-introduction (hey, it's me), personal journal (today I went), and romantic messages (here is a romantic message.answer:).Also, VLIS plays pun on the word apple (see apple laptop in the image and apple of my eye).Refer to appendix G for more baseline samples.

Related Work
Combining VLMs with text-only LMs.Early large-scale VLMs (LXMERT (Tan and Bansal, 2019), VisualBERT (Li et al., 2019) and ViL-BERT (Lu et al., 2019)) saw the benefits of text-only pretraining by initializing their text encoder with a masked language model BERT (Kenton and Toutanova, 2019).Later, Frozen (Tsimpoukelli et al., 2021) started a trend of freezing the language model and learning only the visionlanguage relationship.More recent models such as Flamingo (Alayrac et al., 2022) and BLIP-2 (Li et al., 2023b) also freeze the image encoder.ES-PER (Yu et al., 2022) uses reinforcement learning to combine image encoders with language models.
Language Model Decoding.Language model decoding is the process of generating text from a pretrained language model.Traditional decoding methods use greedy decoding and beam search to find the most likely sequence of words.The truncated sampling algorithms such as Top K sampling (Fan et al., 2018;Holtzman et al., 2018;Radford et al., 2019), Nucleus sampling (Holtzman et al., 2020), and Typical P sampling (Meister et al., 2022) have been proposed to avoid text degeneracy.Recent deterministic algorithms, such as Contrastive decoding (Li et al., 2022b) and contrastive search (Su et al., 2022b;Su and Collier, 2023), provide a better trade-off between text fluency and model likelihood.Neurologic (Lu et al., 2021) and Neurologic A*esque decoding (Lu et al., 2022b) control the language models to include given words in their outputs.As shown in the experiments, VLIS can be used jointly with any decoding method, including beam search and contrastive search.

Conclusion
We propose VLIS, a novel framework to alleviate the language modeling burden of visual-language models (VLMs).VLIS combines the linguistic understanding capability of the text-only language models with the visual conditioning strength of the VLMs by importance sampling.To isolate the VLMs' visual conditioning power, VLIS uses pointwise mutual information to suppress their text-only marginal distribution.Our framework enhances the base VLM in commonsense reasoning (WHOOPS (Bitton-Guetta et al., 2023), OK-VQA (Marino et al., 2019), and ScienceQA (Lu et al., 2022a)) and complicated text generation (Concadia (Kreiss et al., 2022), Image Paragraph Captioning (Krause et al., 2017), and ROCStories (Mostafazadeh et al., 2016)) problems.In the future, VLIS can be extended to incorporate other modalities for which the paired multimodal data is even scarcer.We hope that VLIS sparks an interest in better utilization of off-the-shelf multimodal pretrained models.

Ethical Considerations & Limitations
Potential ethical concerns.As an inference time method, VLIS inherits some known problems of both the VLMs and the unimodal text-only language models as well: • Hallucination: VLMs are known to hallucinate information absent in the training data (Rohrbach et al., 2018).While VLIS may strengthen visual conditioning and thereby contribute to reducing the rate of visual hallucination, completely eradicating it is beyond the scope of this research.
• Social bias: It is widely known that VLMs reflect or even amplify (Hendricks et al., 2018;Hirota et al., 2022) social bias (e.g.gender or race) in the training data.We have yet to determine how VLIS affects social bias in the base models.Thus, outputs generated using VLIS may contain social bias.
It is a meaningful direction to combine VLIS with reinforcement learning (Ramamurthy et al., 2023;Yu et al., 2023) or reward-based decoding algorithm (Su et al., 2022a) to alleviate the problems above, but we leave that to future research.
Limitation of VLIS and future work.Firstly, we acknowledge that this paper only explores a small fraction of the possible combinations of textonly models and VLMs.A large-scale wide search in this regard would reveal 1) the better-performing pairs of text-only LM and VLM and 2) the required characteristics of a good model pair.
Secondly, VLIS could be extended to more modalities than the image-to-text generation problem covered here.Other modalities, such as audio and document may also benefit from applying VLIS to their modality-specific foundational model.
Finally, VLIS can be short-sighted.The method combines the outputs of the VLMs and the textonly models at the very last stage of token likelihood.As a result, VLIS score might be misleading when both models assign high probabilities to the same token for different reasons (e.g.homophones).It may help to estimate scores for the future generated text by rolling out a few generative steps and aggregating the output (Lu et al., 2022b), which we leave to future works.

A VLM Failure Cases A.1 Landmark Recognition Experiment
To better understand the named entity recognition problem in VLMs' image descriptions, we check whether their descriptions for pictures of popular landmarks contain the proper names.We first collect the names of the 100 most popular landmarks2 .Then, we filter the list by removing names of landmarks without proper nouns (e.g.Middle of the Earth), keeping 80 landmarks in total.Finally, we download the corresponding pictures from Wikipedia.Given the prompt What is this?, we task the VLM to generate a response as long as 100 tokens and check whether the output contains the name of the given landmark.Note that some landmarks have alternative names.Hence, we collect alternative names from Wikipedia and count the model-generated answer as correct when it contains any of the possible names.Finally, we check whether the model tried to answer or not by inspecting whether the model-generated text contains the name of any landmark in our list.We calculate the precision score by dividing the number of correct predictions by the number of tries.
Our landmark dataset3 is tiny compared to the similar dataset (Weyand et al., 2020) for a purpose: we want to check whether the VLM avoids telling the named entities, not whether the VLM saw them in the training process.Hence, we narrow the scope of evaluation to the most popular landmarks, in which we can assume that most of the entity names are found in the VLM training dataset.
Table 7 and Figure 6 compare base LLAVA (Liu et al., 2023) and VLIS in our landmark recognition dataset.The result shows that the VLM (LLAVA) knows at least about half the landmarks' names, but does not tell them without applying VLIS.Also, VLIS shows good precision, showing that it does not get more correct answers by guessing more.We further demonstrate that a proper answer to our prompt What is this?should contain the name of the landmarks: when we present GPT3 with the ground-truth alt captions and the prompt, GPT3 always includes the landmark names in its output.

A.2 More Qualitative Results
Figure 7 shows full raw text outputs for the VLM failure cases shown in Figure 1. Figure 8

Ours:
At nighttime, the Marina Bay Sands building, also known as the hotel tower, is floodlit and its surrounding harbor is bustling with activity.

LLAVA:
The image features a stunning view of a large building situated on the water.The building appears to be a hotel or a resort, and it is connected to a nearby island by a bridge.

Ours:
The structure in the pictures is the well-known Sagrada Familia basilica in Barcelona, Spain.

LLAVA:
the image features a large cathedral with a tall tower and a steeple, which is likely a famous landmark in the city.more samples for the failure case 2: the base VLM (LLAVA) is distracted by misleading visuals while VLIS does not.

B Implementation Details
Computational Requirements.Using LLM.int8 approximation (Dettmers et al., 2022), a single NVIDIA TITAN RTX GPU (24GB Memory) fits both the BLIP-2 2.7B and OPT 1.3B models.Flan-T5 XL and XXL models need more memory and VLIS using the larger backbones requires NVIDIA A6000 GPU (48GB) for inference.Both LLAVA 13B and Vicuna 7B fit into an A6000 GPU at the same time.Generating 50 tokens takes ∼ 20 seconds in all settings.
Hyperparameters.We fix the fluency threshold α = 0.001 in all experiments and use beam search with beam size 5.For QA problems, we apply length penalty < 0 on the beam score to induce succinct answers following the literature (Li et al., 2023b).The opposite behavior is required for longer text generation, so we set the value larger than 0 for open-ended generation problems.The language temperature τ is manually selected by examining the text quality of three samples per task.
Task-Specific Hyperparameters.For VQAv2 (Goyal et al., 2017), OK-VQA (Marino et al., 2019), and ScienceQA (Lu et al., 2022a) datasets, we set the language temperature τ = 1.25 and length penalty −1.0 to induce succinct answers generated with stronger visual conditioning.In Concadia (Kreiss et al., 2022), τ = 0.67 and length penalty −2.0 is used for succinct caption-style text with better text conditioning.For Image Paragraph Captioning (Krause et al., 2017) experiments we use τ = 0.67 and length penalty 1 to induce longer captions.Also, we apply contrastive search (Su and Collier, 2023) with a penalty of 0.6 to avoid text degeneracy.
Flan-T5 Hyperparameters.For the backbone comparison study in appendix E, we set the VLM backbone to BLIP-2 Flan-T5 (Li et al., 2023b) and text-only model to Flan-T5 (Chung et al., 2022).For Flan-T5 variants, we compensate the overconfidence of the model with a large temperature of 1.5 to normalize the logit outputs.For the same reason, we also relax the fluency threshold α = 0.0001.Finally, the language temperature τ is set to 0.9.
Baseline Hyperparameters.We share the same hyperparameters as in VLIS for all our implemented baselines; LLAVA, BLIP-2, OPT-IML, and Naïve Ensemble.We do not modify the beam size 5 and fluency threshold α = 0.001, and change the length penalty accordingly to the task following the VLIS hyperparameters.
Few-Shot Settings.For Image Paragraph Captioning (Krause et al., 2017), we use three ground-truth examples to prime the models for the paragraph-long generation task.However, one cannot provide multiple images as inputs to the backbone VLM model (BLIP-2 (Li et al., 2023b)).Hence, we simply insert the few-shot samples in the text domain and provide only the single target image as the visual context.
uint8 Inference.LLM.int8 (Dettmers et al., 2022)  degradation.We employ the technique to jointly run both text-only LM and VLM on a single GPU.
Randomness.As VLIS is a deterministic inference time algorithm, no randomness is involved in any of the experiments.A stochastic sampling version of VLIS may require variance analysis, but we leave that to future research.
Evaluating Story Generation.While the official repository of MAGIC (Su et al., 2022a) shares the inference results, it does not contain the evaluation scripts.Thus, we consult the repository Contrastive Decoding (Li et al., 2022b) for the evaluation script for an open-ended generation problem.Due to the difference in the evaluation code, our baseline scores are different from the results reported in MAGIC (Su et al., 2022a).However, we still use the public inference results for the baselines and evaluate each model with a publicly available code, making our evaluation pipeline unbiased, transparent, and reproducible.

C Marginal Approximation Experiment
In the main paper, we propose using one or two images with minimal visual information (blackfilled and white-filled) as a functional candidate with minimum computational overhead.To investigate the alternative approaches, we conducted an additional experiment in the OK-VQA dataset.The variables considered here are 1.Random vs. predefined (black-filled and white-filled) set of images and 2. The number of images used to approximate the expectation.We keep everything else the same as in Table 2 and only adjust the marginal approximation scheme.
Our results are summarized in Table 8.First, a random set of images is inferior to our predefined set of images for approximating the marginal.Second, 10 random image set offers a better approximation than the predefined set of two images.Still, the 10 random images option requires 11 passes of VLM per token generation, making it largely inefficient for practical usage.

D Fluency Threshold Experiment
Here, we examine the effect of fluency threshold value α on the generation quality of VLIS.This experiment extends the OK-VQA commonsense reasoning experiment in Table 2 and keeps all other variables the same except for α.
Table 9 shows that VLIS consistently outperforms the VLM-only baseline for all values of α in the range of [1e − 3, 1e − 5].Too large values ([1e − 1, 1e − 2]) still harm the performance as they typically leave only one or two token candidates for the VLIS Score to choose from.

E Backbone Scale Experiment
We conduct a comparison study to test whether the improvement offered by VLIS is generalizable to a wider set of architectures and model sizes.
Here, we mainly evaluate VLIS with Flan-T5 variants as both the text-only LM and VLM backbones.T5 (Raffel et al., 2020) is an encoder-decoder transformer unlike the decoder-only autoregressive language models (e.g.OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020)).Flan-T5 (Chung et al., 2022) further trains T5 for better responsiveness in instruction prompts.Table 10 summarizes the backbone comparison results on the OK-VQA dataset (Marino et al., 2019).In all combinations of model sizes except for FlanT5 Base , VLIS improves the commonsense reasoning capability of the VLM backbone.Also, Naïve Ensemble performs unreliably depending on the choice of the text-only LM and performs worse than the VLM itself in most of the settings.The FlanT5 Base LM makes VLIS work worse than the VLM.Since VLIS is built on the assumption that the text-only LM knows the  (Marino et al., 2019).F-T5 denotes T5 trained on FLAN dataset (Wei et al., 2022).
human language distribution better than the VLM, this deterioration of performance further supports our explanation of why VLIS works.

F Prompt Templates
In the prompt templates below, TLM denotes the prompt presented to the text-only model and VLM denotes that given to the VLM.

Figure 2 :
Figure 2: Comparison of VLIS and standard VLM decoding process.Using the VLM, we first obtain the imageconditional p vl (Answer|image) and text-only likelihood p vl (Answer) given an image and a prompt or question.Then, we compute the exponentiated pointwise mutual information (PMI) with the likelihoods.Finally, the exponentiated PMI score is used as the importance weights for the text-only model likelihood p text (Answer).

Figure 3 :
Figure 3: Qualitative samples from WHOOPS (Bitton-Guetta et al., 2023) experiments.As marked in green, specific descriptions are required to explain weirdness.

Figure 4 :
Figure 4: Generation results in the OK-VQA dataset (Marino et al., 2019).We color the intention of the question green and answers that defy such intention with red.(c) and (d) are failure cases.
Figure5: Open-ended generation results with BLIP-2(Li et al., 2023b) as the base VLM.We use three text prompts (Hey, It's me, Today I went, and Here is a romantic message.Answer:) to test whether VLIS can actively adjust its response according to the text prompt while maintaining visual alignment.
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-01361), Institute for Project-Y, and NCSOFT Vision/NLP Center.

Figure 6 :
Figure 6: Comparison of LLAVA and VLIS in the landmark recognition experiment.
caption that describes the image.Article: "[ARTICLE]" Caption: image is strange or natural in terms of physics, commonsense, or etc.\n Start with "The image shows" -VLM Decide whether the image is strange or natural in terms of physics, commonsense, or etc.\n Start with "The image shows" G More Qualitative Samples We include more qualitative samples for image paragraph generation in Figure 9, story generation in Figure 10, and open-ended generation in Figure 11.BLIP-2 person, argentina's greatest player of all time LLAVAThe image features a soccer player who has just won a championship, as he is lifted up by a crowd of people celebrating the victory.The soccer player is holding a trophy, signifying the achievement.Ours Sources suggest that the person being described is Diego Maradona, a professional footballer who won the Golden Boot award at the 1982 World Cup.In the image, Maradona is being lifted above a stadium crowd BLIP-2 yes LLAVA While I cannot determine the man's intentions or emotions as an AI, the image shows a man in a suit, who appears to be smiling and has his fingers crossed.This body language might suggest that he is confident Ours Michael Corleone, also known as Vito Andolini Corleone, the character depicted in the Godfather film series, is shown to care deeply for his immediate family.GPT3No, chimpanzees do not have tails.They are one of the few primates that lack a tail.Chimpanzees belong to the family Hominidae, along with humans, bonobos, gorillas, and orangutans.BLIP-2 yes.LLAVA Yes, chimpanzees do have tails.Their tails are short and usually held upright or curled up over their backs.In the image, the chimpanzee is walking on a tree branch with its tail in Ours Chimpanzees do not have tails.Chimpanzees are primates and are closely related to humans.They are also known as great apes and are characterized by their large brains, hands and feet, and the ability to GPT3 No, ostriches are not capable of flying.They are the largest living species of bird and are known for their long, powerful legs and neck, which they use to run at high speeds, but they do not have the physical characteristics BLIP-2 yes, but they are not very good at it LLAVA Yes, ostriches can fly.Although they are large birds, they are capable of taking flight using their powerful wings.Ostriches have strong muscles and a high lift-to-drag ratio, which enables them to become air Ours No, ostriches are not capable of flying.They are ground-dwelling birds that are known for their long, powerful legs and fast running speed.In the wild, ostriches migrate by running long distances, but Does he care for his family?

Figure 7 :
Figure 7: Raw text output for samples described in Figure 1.

Figure 11 :
Figure 11: Open-ended generation results with various text prompt.Here we include more baselines than in Figure 5.

Table 3 :
(Lu et al., 2022a)on ScienceQA test set(Lu et al., 2022a).IMG denotes subset with image context, TXT the text context subset, and NO the subset without any context.

Table 4 :
(Kreiss et al., 2022) we show that VLIS achieves meaningful development over the backbone VLM (BLIP-2).Also, the text-only backbone (OPT-IML) and Naïve Ensemble perform substantially worse, proving that VLIS is not just imitating the text-only model outputs.Instead, VLIS adaptively fuses the commonsense understanding capability of the text-only model with the visual conditioning of the VLM.Results: maintaining visual specificity.When VQAs do not require text-based reasoning, VLIS should focus on visual conditioning only.The rightmost column of Table2summarizes results on VQAv2(Goyal et al., 2017)dataset, a VQA dataset that has its textual bias intentionally removed.As Results on Concadia(Kreiss et al., 2022)test set.Cap denotes caption and Desc description annotations.We report CIDEr following the literature.
Table 6 presents the results of openended story generation.VLIS outperforms both Contrastive Search and MAGIC in all metrics.While Naïve Ensemble builds more diverse text (rep-2 and div.), its severely low coherence score suggests that its stories are less consistent, as represented in qualitative samples of appendix G. Fi-

Table 7 :
Results on our landmark recognition experiment.Acc denotes accuracy and Prec denotes precision.

Table 8 :
is an approximated inference technique for large language models.It applies vector-wise quantization and mixed-precision decomposition to reduce memory consumption without performance Results in the OK-VQA validation set.Our default option (prefined set with two images) is marked bold.

Table 10 :
Backbone comparison experiments on the validation set of the OK-VQA dataset