Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

Large-scale pretrained language models have made significant advances in solving downstream language understanding tasks. However, they generally suffer from reporting bias, the phenomenon describing the lack of explicit commonsense knowledge in written text, e.g., ”an orange is orange”. To overcome this limitation, we develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities. Specifically, we leverage two complementary types of ”imaginations”: (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks. Notably, fueling language models with imagination can effectively leverage visual knowledge to solve plain language tasks. In consequence, Z-LaVI consistently improves the zero-shot performance of existing language models across a diverse set of language tasks.


Introduction
Large-scale Pretrained Language Models (PLMs) have achieved great success on various Natural Language Understanding (NLU) tasks and even exhibit impressive zero-shot capabilities without taskspecific fine-tunings (Radford et al., 2019).And recent research suggests that such ability improves by further scaling up the model size (e.g., to hundreds of billions of parameters) and the amount of textual pretraining data (to TBs of raw texts) (Min et al., 2021;Brown et al., 2020;Chowdhery et al., 2022;Kaplan et al., 2020).However, zero-shot language learners solely trained on texts inevitably suffer from human reporting bias.For example, people tend not to write common or apparent things (Grice, 1975), and the frequency of a certain textual statement does not always correspond to their relative * Work was done during the internship at Tencent AI Lab.This example is {news type}.
A news image of {news type}.Imagine

Textual Visual
Figure 1: Our system endows language models with two complementary types of visual imagination capabilities: recalling existing images (through retrieval) and synthesizing nonexistent images (via image-to-text generation).They effectively alleviate the reporting bias issue and improves the zero-shot performance for solving plain language tasks.We experiment with three types of tasks: (a) Word Sense Disambiguation, (b) Science Question Answering, and (c) Topic Classification.
likelihood in the world (Gordon and Van Durme, 2013).Therefore, looking into other modalities to supplement the textual information is crucial.
In this paper, we focus on incorporating vision knowledge to facilitate the solution of plain lan-guage understanding tasks.Cognitive science has demonstrated that the human vision system is crucial to supplement, interact, and influence the language system (Dessalegn and Landau, 2013).For example, there exists a fast mapping between vision and language in the human language learning process (Altmann and Kamide, 2004).Inspired by this, we propose a visual imagination framework, Z-LaVI, to endow any PLMs (e.g., GPT, BERT, BART, etc.) with visual imagination capabilities.
Specifically, we apply two different types of "visual imaginations" to the input texts.Given input text, the first approach recalls existing images (e.g., through search engines), and the second one synthesizes nonexistent images via text-to-image generation models (e.g., DALL-E (Ramesh et al., 2021)).These two strategies mimic different types of human mental behaviors, i.e., recalling past memories and creative mental image construction.Interestingly, we find that these two mechanisms are highly complementary to each other.Our proposed visual imagination module tends to rely more on recalling when input texts are short because their corresponding objects or scenes generally exist and are easy to find.However, when input texts are long and complex, the module is more inclined to create new images.We develop a unified framework (Figure 1) that exploits both types of imaginations along with the original textual inputs to compose zero-shot solutions to a broad set of downstream language tasks.Note that our work differs from existing multi-modal tasks such as VQA (Antol et al., 2015;Wu et al., 2017) or Visual Dialog, (Das et al., 2017), which have both textual and visual inputs.Instead, we use visual imagination as machinery to facilitate the (zero-shot) solution of pure language tasks.
We show that on a diverse set of language understanding tasks, Z-LaVI consistently improves the performance of existing language models of different sizes and architectures.In particular, our Z-LaVI with SBERT can achieve a zero-shot F1 score of 87.5% on the WSD task without finetuning, even outperforming BERT-large, which is fine-tuned with three examples per sense, by 2.3%.Z-LaVI also beats all existing zero-shot models on four Science QA tasks and two Topic Classification tasks by a large margin.Our analysis demonstrates that Z-LaVI can complement language models and significantly alleviate PLMs zero-shot prediction errors by adaptively executing two visual imagination mechanisms -RECALL and SYNTHESIS.The overview of the proposed Z-LaVI system.Z-LaVI aims to solve the language tasks with two streams of inputs: one stream of label options and another stream of instance to be labeled.Z-LaVI converts one of the streams (either input labels or input instance) into images through visual imagination (RECALL and SYNTHESIS) to enable the vision-text model to solve language tasks.We ensemble the language and visiontext models to make the final prediction.

Task Formulation
To provide a zero-shot solution for language tasks and solve them in a uniform way, we transform different tasks into multi-choice questions, where input stream x and candidate answers stream y ∈ Y are provided.The goal is to select the correct answer from Y.In particular, for word sense disambiguation tasks, x is the instance sentence, and Y are all possible word senses of the target word; for science question answering tasks, x is the question, and Y are answer options; for text classification tasks, x is the input sentence, and Y is the pool of categories.To make a prediction, the model needs to estimate the plausibility of each tuple (x, y) for all y ∈ Y and select the best answer ŷ. ŷ = argmax y∈Y P (y|x). (1)

Language Models for Zero-shot Tasks
We consider three main approaches for employing language models to make zero-shot predictions on language tasks: Prompt-based Approach (Petroni et al., 2019;Schick and Schütze, 2021) treats Natural Language Understanding tasks as a cloze test using prompts.
For example, we can format question-answering tasks into: We convert the input (x, y) into a sequence of tokens W = (w 1 , ..., w t , ...w t+k , ..., w |W| ) via a prompt, in which y = (w t , ...w t+k ). 2 We apply autoregressive language models such as GPT (Brown et al., 2020) to calculate the score: , where P La (•) denotes the probability given by the language model.Note that we adopt the standard token-length normalization to handle different lengths of answer choices.Finally, we apply softmax to Score La (x, y) to obtain the probability of each candidate: p La (y|x) = e Score La (x,y) y∈Y e Score La (x,y) . (2) For the prompt-based approach, we select GPT-Neo-1.3B/2.7B(Black et al., 2021), GPT-J-6B (Wang and Komatsuzaki, 2021) and OPT-30B (Zhang et al., 2022c) as our models.The GPT-Neo and GPT-J are trained on the Pile dataset (Gao et al., 2020), which contains 825 GB of English text data.Besides Pile, OPT concatenates the training data of RoBERTa (Liu et al., 2019) and PushShift Reddit (Baumgartner et al., 2020).
Natural Language Inference (NLI) Approach (Yin et al., 2019) propose a textual entailment framework for zero-shot text classification.The NLI approach considers the input pair (x, y) as a (premise, hypothesis) pair to predict the probability that the premise logically entails the hypothesis.
2 We include the prompts of all tasks in Table 9.
Latent Embedding Approach utilizes an off-theshelf feature encoder f θ to project the input tuple (x, y) into a shared latent space and determines their relevance based on a distance metric -cosine similarity scores: Relevance scores are normalized with softmax (equation 2) to get the final probabilities.

Language with Visual Imagination
Visual Imagination aims to convert either x or y (depending on the task) in the textual input tuple (x, y) into an image.For WSD and QA tasks, we imagine the candidate options y.While for topic classification tasks, we imagine the instance sentence x.Here we illustrate our method through the example of imagining y.We propose two imagination mechanisms: 1) RECALL and 2) SYNTHESIS.
1) RECALL: We use the text input to query Bing Image Search5 to recall the corresponding images.We set a maximum number of images for each query.When only limited images are available for some queries, we download all of them.
2) SYNTHESIS: We adopt DALL•E (Ramesh et al., 2021), a text-to-image generation model pretrained on image-caption pairs, to synthesize images.DALL•E constructs a codebook V using a discrete variational autoencoder (dVAE) (Rolfe, 2016) to map the image into tokens concatenated with the caption's text tokens.DALL•E models the joint distribution over the text and image tokens with an autoregressive transformer.During inference, DALL•E feeds the text tokens y into the transformer and generates a sequence of image tokens (v 1 , v 2 , ..., v m ), where an image token v i is predicted based on the previous ones: in which, V is the visual codebook.After we generate enough image tokens, we decode the tokens into images by looking up the vectors in the dVAE codebook to construct the pixels.We iterate the SYNTHESIS process multiple times and combine with the images from RECALL to collect a set of K images {I k y |i = 1, ..., K} for each textual input y.6 Vision-Text model for Zero-shot language tasks.After transferring an input stream into images, we modify a plain language task into a multimodal task.Thus we can apply vision-text models to solve the problems.We choose CLIP (Radford et al., 2021) as our vision-text model, which is pre-trained on 400M image-caption pairs with the contrastive learning strategy.
CLIP has a text encoder f T and a visual encoder f V , which can project text and image into the shared latent space.Similar to the latent embedding approach described in 2.2, we aggregate the K images collected previously and use CLIP to compute the relevance score of (x, y): and we obtain a probability distribution through softmax (over y): Ensemble Language and Vision Prediction.Our system is designed for zero-shot tasks without labeled data to learn weights to ensemble the two models.Therefore, we adopt a weighted sum as the late fusion over the final output distributions of the language and multi-modal models: where we design a heuristic function to calibrate the weight w based on the relative size between the vision-text model and the language model: where P VI and P La are the number of parameters of the models.We hypothesize that when the language model's size increases, it will encode more knowledge and thus rely less on the vision model.
The number of parameters of each model and their corresponding weight is listed in Table 8.
3 Experimental Setup

Datasets
We evaluate our methods on six datasets of three tasks.Table 1 shows dataset statistics.
CoarseWSD-20 (Loureiro et al., 2021) is a coarsegrained WSD dataset built from Wikipedia.The dataset consists of 20 nouns with 2-5 senses per noun (53 senses in total).Each sense is associated with a definition which is the first sentence on its Wikipedia page.CoarseWSD guarantees that every sense has test instances in the test set.On average, each sense has 192 test instances.
QASC (Khot et al., 2020) is a multi-hop, 8-way choice question answering dataset collected by decomposing sentences about scientific facts.We report the performance of the development set, which contains 926 questions.
SciQA (Welbl et al., 2017) is a dataset of 4-way multiple-choice science exam questions spanning from elementary to college-level covering chemistry, biology, physics, etc.We evaluate the development set with 1,000 questions.
ARC (Clark et al., 2018) consists of 7,787 natural, grade-school level science questions.The ARC dataset is split into easy (ARC-E) and challenge (ARC-C), where questions in the challenge set contain the ones that simple retrieval or word correlation methods cannot answer correctly.We evaluate the development sets of ARC-E and ARC-C, which contain 570 and 299 questions, respectively.
AG News (Zhang et al., 2015) is a news topic classification dataset, and each sentence is associated with one of the four news types: word, sports, business, and technology.We run our models on the 7,600 examples in the test set.
Situation (Mayhew et al., 2018) is a event-type classification task.The dataset has 12 events: need water, need infrastructure, crime violence, etc.The original task on this dataset is multi-label classification and has an out-of-domain class.As the multi-label prediction requires a fine-tuned threshold to determine the predictions and is thus not suitable for zero-shot models, we remove those examples with more than one label and ones with the out-of-domain label, resulting in 1,789 instances.

Baselines
Aside from the zero-shot language models described in the section 2.2, we also evaluate on a random baseline and compare with previous work.For CoarseWSD-20, we compare with the BERTlarge few-shot (1-shot/3-shot per sense) results reported in Loureiro et al. (2021).
For QA tasks, we include the Information-Retrieval (IR) solver (Clark et al., 2016), which combines the question and option as a query and sends it to a search engine to check if they are explicitly written in some corpus.We also choose SMLM (Banerjee and Baral, 2020) as another baseline -a RoBERTa-large model fine-tuned on triplets extracted from knowledge graphs such as ATOMIC (Sap et al., 2019).
We compare topic classification with the TEwiki (Ding et al., 2022), the state-of-the-art model on zero-shot topic classification trained on a dataset collected from Wikipedia.

Evaluation Metrics
We report the accuracy of all question-answering and topic-classification datasets.For CoarseWSD-20, we compute each word's accuracy and F1 score and take the mean score of all 20 words.

Implementation Details
Image Collection We adopt Bing Image Search to RECALL images.And for image SYNTHESIS, we utilize the newly released DALL•E-mini7 which chooses VQGAN (Esser et al., 2021) as the image encoder/decoder and BART (Lewis et al., 2020) as the autoregressive transformer.For every textual input, we obtain 100 images from each of the two methods.The 200 images are sorted using CLIP based on their similarity with the text input.We preserve each text input's top-10 images (K = 10) and feed them into the equation 6 to calculate the vision-text probabilities.

Model Implementation
The GPT-style and NLIbased language models are built on top of the huggingface API. 8 For NLI models, we use the recently released zero-shot classification pipeline. 9We use the official release of SBERT10 and SimCSE11 to implement the latent embedding approach.The CLIP model is adapted from the OpenAI's public repo, 12 and we select the ViT/B32 as the image encoder.The experiments were run on 3 × 8 NVIDIA V100 32GB, which can generate 24 images in 5 seconds.The majority of the running time of our model is image generation.In total, we employ DALL•E-mini to generate approximately 1.8M images which take around 104 hours.

Main Results
Z-LaVI boosts the performance of language models.Table 2, 3 and 4 show results on seven datasets of three tasks.Each dataset has two results columns: the original performance of the language models and the ensembled performance by adding our Z-LaVI model.We observe that in most cases, Z-LaVI consistently improves the performance of different language models.Especially in the WSD task, our Z-LaVI with SBERT can outperform the BERT-large fine-tuned with 3-shots of each sense.Z-LaVI also significantly enhances the language models on topic classification task where the best language model with Z-LaVI beats the SOTA zeroshot topic classification model TE-wiki by 2.8%.For science QA tasks, we can see Z-LaVI improves on QASC, SciQ, and ARC-E, but it struggles on the ARC-C, and adding Z-LaVI degrades the performance of a few language models.This is because the ARC-C questions are designed to be hard to answer using retrieval or correlation, and Z-LaVI uses CLIP, which is pre-trained on the image-text correlation only.Z-LaVI without language model is a strong baseline.Surprisingly, we also find that Z-LaVI w/o language model performs well on plain language tasks.In some datasets, such as QASC, Coarse-WSD, and topic classification tasks, Z-LaVI w/o LM outperforms the language models without fine-tuning on the downstream datasets (e.g., Sim-CSE, GPT-Neo-1.3B/2.7B).This indicates that the vision-text model pretraining on image-caption pairs learns the knowledge that can be leveraged to solve single modality tasks.
Ensembling two language models is not as good as Z-LaVI.To verify the effectiveness of using visual knowledge, we replace the visual imagination of Z-LaVI with another language model -SimCSE.We select SimCSE here because Sim-CSE is trained fully unsupervised and has the same contrastive learning objective as CLIP.We define the performance gain (PG) of model We include all the language models (exclude Sim-CSE) in the set M13 and calculate the average performance gain on a dataset by: For fair comparison, we fix the ensemble weight w = 0.5 in equation ( 8) for both SimCSE and Z-LaVI.We also include the Z-LaVI with dynamic ensemble weight controlled by equation ( 8).The performance gain of SimCSE and Z-LaVI on all six datasets is shown in Figure 3.We observe that Z-LaVI consistently has higher performance gain than SimCSE across all datasets, demonstrating that the visual information provided by Z-LaVI complements language models more hence boosts more on performance.Additionally, Z-LaVI with dynamic weights perform better than simply setting the weight to 0.5.

Analysis
Vision and Language models behave differently.We define the overlap of correctly predicted examples between two models as: where S M * is the set of correctly predicted examples of model M. Figure 4 shows the overlap of models' predictions in the Situation dataset.We observe that Z-LaVI (w/o LM) has an obviously smaller overlap with the other models, while different language models have a big mutual overlap.This difference explains the substantial performance gain after exploiting visual imagination.RECALL vs. SYNTHESIS.We ablate on the imagination methods and compare the performance of only using one of the methods.Table 5 demonstrates the performance on each dataset with different imagination methods.We can see that for the dataset with short inputs for imagination (e.g., QA tasks), RECALL is better than SYNTHESIS.This is because short inputs of science QA datasets normally correspond to objects that exist in the real world and are easy to find on the web, such as mollusca and porifera shown in Figure 7 (a).However, for queries with long sentences (WSD and Topic Classification), the text inputs are too specific to match any real photo.Hence SYNTHESIS is preferable.14 Figure 5 also indicates that the model prefers to choose RECALL images for short input and tends to use SYNTHESIS images when the input contains more tokens.We also find that without images, Z-LaVI has poor performance on all tasks, reflecting the necessity of imagination.
Performance vs. Image Quantities.We combine RECALL and SYNTHESIS to imagine 200 image candidates.We wonder whether the num- Table 6: Zero-shot probing on the three relation types (COLOR, SHAPE and MATERIAL) in ViComTe (Zhang et al., 2022a) dataset.We report the average Spearman correlation (ρ) and top-1 accuracy (Acc@1).ber of imaginations impacts the Z-LaVI's performance.Figure 6 reports Z-LaVI's performance on CoarseWSD-20 versus the number of images.We observe that Z-LaVI's F1 score increases with a higher number of images.While the improvement is marginal when the number is higher than 125.
Z-LaVI supplements visual commonsense knowledge.To further validate Z-LaVI helps to mitigate reporting bias problems of language models, we conduct experiments on ViComTe (Zhang et al., 2022a), which is a commonsense knowledge dataset containing different types of properties for over 5000 subjects, e.g., the subject "egg" has the property (object) "oval".We investigate three relation types (COLOR, SHAPE, and MATERIAL) and report the results on the test set (see Table 7 for details).We select the BERT-large, and Oscar-large (Li et al., 2020) as the baselines of which the results are directly obtained from Zhang et al. (2022a). 15 For a fair comparison, we adopt the same set of seven prompt templates provided by Zhang et al.   15 We also include the random baseline by assigning a number between 0 and 1 for each class by chance.We iterate the random runs 7 times and report the average performance.(2022a) and report the average performance over these prompts.Table 6 demonstrates the performance of Z-LaVI with language models.We can see Z-LaVI continue to consistently boost the performance of language models and outperform the baselines with significant margins.Vision-Language Pretraining Models To connect vision and language semantics, a line of work on multimodal masked language models (Li et al., 2019;Tan and Bansal, 2019;Lu et al., 2019;Su et al., 2020) explores vision-language pretraining and achieves SOTA fine-tuning performance on multimodal benchmarks.edge into PLMs.To retrain knowledge in both vision and language pretrained models, Flamingo (Alayrac et al., 2022) freezes both pretrained models and brings in additional model components to do visually-conditioned autoregressive text generation.Tan and Bansal (2020) retrieve related images as vokens (visualized tokens) and then process large language corpora (e.g., Wikipedia) into voken-prediction tasks.FLAVA (Singh et al., 2022) is an alignment model that pretrains on both unimodal and multimodal data while optimizing crossmodal "alignment" objectives and multimodal fusion objectives.Unified-IO (Lu et al., 2022a) is a general-purpose model which can perform a wide range of vision, language, and multimodel tasks by unifying the inputs and outputs as sequences.

Conclusion
In this paper, we propose a novel approach, Z-LaVI, to alleviate the reporting bias problem of pretrained language models and enhance their zero-shot inference ability.We develop two complementary visual imagination mechanisms, i.e., RECALL that aims to retrieve existing objects or scenes and SYN-THESIS that generates nonexistent ones.Experiments on a wide range of language tasks show that our approach can significantly outperform existing zero-shot language models, pointing towards a promising direction to solve an unseen language task with visual imagination.

Limitations
Our experiments apply DALL•E-mini for synthesizing the images, but the quality and resolution of the generated images are still low, which can be the factor limiting Z-LaVI's performance.However, the recent breakthroughs of DALLE•E-2 (Ramesh et al., 2022), Imagen, (Saharia et al., 2022) and the open-sourced Stable Diffusion (Rombach et al., 2022) give us hope to obtain more realistic images and thus further unleash the potential of Z-LaVI.
The negative results on ARC-C reveal the lack of complex reasoning ability in the current zero-shot vision-text model.At the same time, the success of Flamingo (Alayrac et al., 2022) on few-shot multi-modal tasks lets us sense the possibility of applying the framework of Z-LaVI with these powerful visual language models to solve broader language tasks.We can foresee the bright future of our method once these powerful resources are publicly available.In this paper, we focus on the zero-shot settings, and thus it is difficult to design a more effective approach to ensemble the language and vision without training data.However, when fewshot examples are available, it is possible to learn a mechanism to automatically calibrate the weights of imagination depending on the input examples.
In addition, the image generation model is trained on unfiltered data on the web, which may leak personal information such as the human face, etc.The generation model may also be biased toward stereotypes against minority groups.Furthermore, compared to language models, our approach requires extra resources such as an image search engine, pretrained text-to-image generation model, etc., which will increase the implementation cost.Finally, we evaluated our method in English datasets only, and we plan to incorporate other languages in the future with the help of multilingual multi-modal models (Huang et al., 2021).
weight-lifting reels as six more lifters fail drug tests.
Figure2: The overview of the proposed Z-LaVI system.Z-LaVI aims to solve the language tasks with two streams of inputs: one stream of label options and another stream of instance to be labeled.Z-LaVI converts one of the streams (either input labels or input instance) into images through visual imagination (RECALL and SYNTHESIS) to enable the vision-text model to solve language tasks.We ensemble the language and visiontext models to make the final prediction.
Figure 7 (b) shows an example that needs multi-hop reasoning where Z-LaVI fails to answer correctly.

Figure 3 :Figure 4 :
Figure3: The average performance gain on each dataset.Z-LaVI (parameter) stands for accounting models' number of parameters (Equation9) to adjust the weights.

Figure 5 :
Figure 5: The percentage of recall and synthesis images within the top 10 images of each dataset.The average token length for each dataset is given on the x-axis.

Figure 6 :
Figure 6: The performance on CoarseWSD-20 with the number of image candidates provided by imagination.

Figure 7 :
Figure 7: Qualitative examples from (a) SciQ and (b) ARC-C.Z-LaVI can successfully answer the questions which can be solved by correlation.Z-LaVI fails to answer the question that requires multi-hop reasoning.
Figure 8: Qualitative examples from AG-News (a, b) and Situation (c, d).

Table 1 :
Dataset statistics for the three tasks.

Table 2 :
Zero-shot performance on Science QA tasks.Z-LaVI represents the performance with our Visual Imagination.Z-LaVI (w/o LM) is the model that only uses vision-text prediction.The best-performed number for each metric is bolded.The numbers are underlined if the original performance is improved with Z-LaVI.The models with * use labeled data for pre-training.The models with † indicate the results are from previous work.

Table 4 :
M 1 Zero-shot Performance on Topic Classification.
(i.e., SimCSE) on top of model M 2 by computing the relative improvement of the ensemble model Ens(M 1 , M 2 ) performance over the original model Orig(M 2 ).

Table 5 :
The performance of Z-LaVI (w/o LM) with different imagination methods.# TOK is the average number of tokens of text inputs in each dataset.✗ means no image is provided to the model, and we only use the text encoder of CLIP.RECALL and SYNTHESIS represent using image search and image generation, respectively.BOTH means combining the two methods.