UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine


Introduction
VQA (Antol et al., 2015), SNLI-VE (Xie et al., 2019), and VCR (Zellers et al., 2019) are visionlanguage tasks, which utilize the text and corresponding image to test a system's cross-modal reasoning ability.These tasks are challenging as requiring models to obtain a joint understanding of visual and textual modality.Nevertheless, they are also meaningful since this capability plays an essential role in daily human-robot interaction, e.g., asking a robot how many people are in the image.Despite the difficulty, a line of work (Tan and Bansal, 2019;Li et al., 2019;Lu et al., 2019;Chen et al., 2019;Su et al., 2019;Li et al., 2020) has been dedicated to resolving these vision-language tasks in a supervised setting and obtaining impressive progress.However, these methods all suffer from a significant problem of being costly as they require expert knowledge to collect well-annotated imagetext data.On the other hand, zero-shot methods for vision-language tasks can successfully bypass this problem without costly annotations.Unfortunately, limited methods and relatively fewer works have been dedicated to exploring this direction.
Recently, CLIP (Radford et al., 2021) has been proposed to acquire visual concepts using natural language supervision.It jointly trains an image encoder and a text encoder on 400M noisy imagetext pairs collected from the Internet by aligning images and texts through a contrastive loss.
Previous works (Song et al., 2022;Subramanian et al., 2022;Shen et al., 2021;Wang et al., 2022b) demonstrated that CLIP can achieve strong zeroshot performance of vision-language tasks by converting original tasks into the image-text matching format.However, they mainly consider matching on an instance or global level, i.e., the whole image or sentence, ignoring the significance of finegrained elements, e.g., keywords in the sentence and objects in the image.Meanwhile, we find these fine-grained elements are important for specific downstream tasks, especially in zero-shot learning.
For instance, in Fig. 1, CLIP makes three incorrect predictions in three zero-shot vision-language tasks.For VQA, the model infers the wrong object "pancake" for the verb "eating", as it does not capture the details in the image (pizza on the table) and captions (pizza is mentioned).We posit that if we can find a proper solution to navigate the model to focus on these detailed pieces of textual and visual information, the model would likely have a better chance of selecting the correct answer label.This conjecture also seems true and generalizable across multiple zero-shot downstream tasks as shown by the three examples from different vision-language tasks, i.e., VCR, VQA, and SNLI-VE in Fig. 1.Yet, we also recognize potential challenges also exist as those different tasks may differ from many perspectives including the distribution of image categories or scenes, the different semantic focus, and format of text premises between declarative statements and questions, and different task formats in terms of image-text matching or classification.
To overcome these challenges, we first identify two common fundamental steps required to utilize the fine-grained information across different visionlanguage tasks: 1) Extraction of the fine-grained information from context information, e.g., the extraction of the word "pizza" from the caption in VQA as in Fig. 1. 2) The semantic matching between these extracted fine-grained information and answer choices or hypothesis.Based on these, we propose a unified approach leveraging these two common steps thus it can assist the model to generalize over different vision-language tasks.For the extractor, we have two branches -1) the vision branch and 2) the textual branch.In the vision branch, we employ Faster-RCNN (Ren et al., 2015) to extract object-level information.We select relevant object regions guided by the question in VQA and VCR or hypothesis in SNLI-VE.After that, we concatenate the whole image and its selected image regions and input them into the image encoder of CLIP.For textual information extraction, we exploit rich information from the image caption generated by a recently-developed captioning model OFA (Wang et al., 2022a) and question in VQA and VCR or hypothesis in SNLI-VE to boost the zero-shot performance.
It's noted that although we employ the image caption and question on a sentence level rather than a word level, we compute the cosine similarity between them and answer texts, which means if there are keywords in the answer texts which can be matched in the caption or question, then we will obtain high scores in zero-shot prediction.
Therefore, it is still a process of fine-grained information extraction.By using fine-grained information, our model outperforms previous methods on zero-shot VQA and we are the first to benchmark zero-shot VCR and SNLI-VE.The experiments confirm the effectiveness of our proposed method.
Our contributions can be summarized as follows: • To the best of our knowledge, we are the first to propose a unified approach based on finegrained information extraction for zero-shot learning of different vision-language tasks.
• Our approach outperforms previous CLIPbased methods for zero-shot learning of VQA and we are the first to study CLIP's zero-shot ability for SNLI-VE and VCR.
• The experiments and ablation studies confirm the generalizability of our proposed method and the significance of visual and textual finegrained information for zero-shot learning of vision-language tasks.

Related Work
Vision-language understanding tasks.Unlike unimodal tasks, vision-language understanding tasks need joint understanding between vision and language, which require a deeper reasoning ability of the system.In VQA (Goyal et al., 2017), given a question, the model needs to understand the details of the corresponding image based on the question to answer correctly.The real images in VQA come from MS COCO (Lin et al., 2014) and each of them is paired with a caption in COCO Captions (Chen et al., 2015).For another task VCR (Zellers et al., 2019), its semantic focus is different from VQA since it concentrates more on commonsense questions.The model needs to answer the recognitive questions (like VQA) at first, then it is also required to correctly answer the cognitive questions, which are rationales of the choice in the first question.Vision-language pre-trained models.Early vision-language pre-trained models (Tan and Bansal, 2019;Lu et al., 2019;Li et al., 2019;Chen et al., 2019;Su et al., 2019;Li et al., 2020) utilize cross-modal transformer (Vaswani et al., 2017) pretrained on well-annotated image-text pairs.Different from these models, contrastive learning frameworks (Radford et al., 2021;Pham et al., 2021;Jia et al., 2021) are trained on noisy image-text pairs crawled from the Internet through contrastive loss, which employs the dot product between visual and textual modality.Due to the large-scale training data, these models acquire rich prior knowledge and show strong zero-shot ability on vision benchmarks like ImageNet (Deng et al., 2009).
Vision-language zero-shot learning.There is a line of work utilizing CLIP to do zero-shot learning for vision-language tasks.ReCLIP (Subramanian et al., 2022) utilizes CLIP to present a zero-shot method for referring expression comprehension (ReC), which outperforms prior zero-shot ReC approaches.CLIP-ViL (Shen et al., 2021) exploits CLIP to do zero-shot VQA by simply concatenating question and answer pair for each question and constructing "question: [question text] answer: [answer text]" as the prompt.Then, they feed the text and image into the text encoder and the image encoder of CLIP, which produces the near-chance

CLIP-TE
The color of the flowers is yellow.
The color of the flowers is yellow and white.
The color of the flowers is white.

Answers
Answer Chosen

VQA, VCR
The flowers are on a table.level performance.The most relevant work to ours is TAPC (Song et al., 2022), which manually designs the prompt and leverages T5 (Raffel et al., 2020), a large pre-trained text-to-text Transformer, to convert the question-answering problem into the image-text matching task.Then, it employs CLIP's remarkable zero-shot image-text matching ability on VQA, whose results surpass CLIP-ViL by a large margin.However, these works handle different tasks on an instance level rather than fully utilizing the visual and textual fine-grained information (i.e., keywords in the sentence and objects in the image) like ours.Moreover, we can tackle a diverse set of tasks but they just concentrate on one specific task.

Method
In this section, we introduce our method for visual and textual fine-grained information extraction to improve zero-shot learning of vision-language tasks including VQA, VCR, and SNLI-VE.

Baseline Method
In the baseline method shown in Fig. 2, we use CLIP to do zero-shot learning of vision-language tasks.CLIP consists of a visual encoder V (e.g., ResNet (He et al., 2016) and ViT (Dosovitskiy et al., 2020)) and a text encoder T (e.g., transformer (Vaswani et al., 2017)), where the image and text are processed independently.Followed by the encoder, there is the dot product (i.e., alignment score) between visual and textual features, i.e., V(image) • T(text).We input the image from VQA, VCR, and SNLI-VE into the CLIP visual encoder.Since there is a difference in task format, answer choices from VQA and VCR and the hypothesis from SNLI-VE are input into the CLIP text encoder.After encoding, we can obtain the alignment score between the image and text.In VQA and VCR, we select the answer with the highest score.In SNLI-VE, there is a clustering process after the dot product, which is demonstrated in Algo. 1, and we select the answer with the lowest score.

Visual fine-grained information extraction
In visual fine-grained information extraction, we aim to find the related image regions to the question in VQA and VCR or the hypothesis in SNLI-VE since these regions can provide local visual clues to complement the global image.The objects and attributes are detected by Faster-RCNN (Ren et al., 2015), which is pre-trained on Visual Genome (Krishna et al., 2017) provided by Anderson et al. (2018).We select the top N relevant image regions (N is a hyperparameter, which will be analyzed in Sec.4.3) by image region score (i.e., cosine similarity) between the textual features of the question or hypothesis and the object class&attribute (e.g., yellow flowers) encoded by RoBERTa (Liu et al., 2019): where R is RoBERTa, cos(, ) is cosine similarity, O is the set of objects detected by Faster-RCNN, Attr() and Class() are attribute and class of object respectively, and Query is the question in VQA and VCR or the hypothesis in SNLI-VE.After selection, the global image and selected image regions will be fed into CLIP visual encoder to obtain the encoded feature of each.

Textual fine-grained information extraction
Next, we present how textual fine-grained information is extracted and incorporated into our framework.To be more specific, two types of information are studied: image caption and question.Questions as a prior can narrow down the range of answer candidates and get rid of irrelevant answers.Image caption can transform the information inside the image into text so that it can be compared with answers in the same domain.Image captions are generated from the image, but their format is language.Thus, we arguably regard image captions as textual fine-grained information.Overcoming the challenge in different formats of vision-language tasks, we introduce a relatively unified way to extract and utilize textual fine-grained information in the zero-shot scenario.
Visual Question Answering: VQA Following previous work, we experiment on the validation set of VQAv2 (Goyal et al., 2017).Typically, VQA is regarded as a classification problem and there are 3,129 most frequent answers used for classification.There are 65 types of questions (e.g., does this type) and 3 types of answers including Yes/No, Number, and Other in VQAv2.
Although in VQA, each image is paired with a ground truth caption from MS COCO, we still choose to use OFA, a SOTA model of image captioning, to generate the caption given the image, because not every dataset is annotated with ground truth captions and we would like to make our method generalizable.
As Shen et al. (2021) shows, directly inputting the concatenation of the question and answer into CLIP will lead to near-chance level performance.In addition, there are more than 3,000 answer candidates in VQAv2, which will largely slow down the inference speed of zero-shot VQA with all answers input into CLIP.To bypass that, we utilize an answer-filtering method to downsize the number of answer choices inspired by Song et al. (2022).
Following Song et al. (2022), we first convert the question-answering format into declarative templates with the <extra_id_0> token by T5 low-shot demonstration.Then, templates with <extra_id_0> token are input into the T5 and we obtain the plausibility of each answer candidate based on T5 output probability.Next, we select the top K answers.More details can be found at Sec. A.2.
In this way, we can downsize the number of answers in VQA.There are three different answer types in VQA, which are processed differently in the answer filtering process.For Yes/No type, we treat it as a binary classification problem.For Number type, since its answers are highly related to numerical answers in the 3,129 most frequent answers set, we heuristically filter 285 numerical answers from 3,129 answers before answer filtering.As for Other type, we preserve the original answer candidates without filtering.
After obtaining top K filtered answers, on one hand, they will be sent to the CLIP text encoder and dot-product with image features will be calculated, denoted as CLIP alignment score S CLIP .On the other hand, we will calculate the question prior score S Question (i.e., cosine similarity between textual features, encoded by RoBERTa, of the question and answers) and the caption prior score S Caption  (i.e., cosine similarity between textual features, encoded by RoBERTa, of image caption generated by OFA and answers).The whole process can be summarized as the following equations: where V and T are image and text encoders of CLIP, R is RoBERTa, O is OFA, and cos(, ) means cosine similarity.I denotes images including one global image I g and N selected image regions {I l ∈ Reg}.Q and A correspond to the question and its top K filtered answers.O(I) means image caption generated by OFA.
In the end, all scores are ensembled.We select the answer with the highest score as zero-shot prediction result: where k 1 , k 2 , and k 3 are hyperparameters.
Visual Commonsense Reasoning: VCR VCR is similar to VQA since both of them are in question-answering formats.However, there are only four answer choices per question, which means we don't need to do answer filtering.Q2A Since there are no answer candidates, we cannot directly compare CLIP alignment scores of answers to select the best answer, as in VQA and VCR.To tackle that, we compute the CLIP alignment scores between image and hypothesis of each sample in whole evaluation set, and cluster those scores into three clusters with three centroids.We rank the centroids from high to low and sequentially treat them as entailment centroid C e CLIP , neutral centroid C n CLIP and contradictory centroid C c CLIP .The detail of clustering can be found in Algo. 1.It's noted that, to make cluster centroids meaningful, an assumption is required: three relationships are uniformly distributed in the evaluation dataset.That assumption is true in SNLI-VE but not guaranteed in other less-calibrated datasets.We can measure how close S CLIP of each sample is to each centroid: where centroid It's noted that due to the lack of answer candidates, we can't get the question prior score S Question .In the end, we ensemble two distances and predict the relationship by picking the closest centroid:

Experiments
In this section, we will talk about benchmark comparison first to show our strong performance.Then, we conduct extensive ablation studies to confirm the effectiveness of fine-grained information.

Experimental setup
Datasets.We analyze three vision-language tasks in our paper.For each of them, we utilize the validation set of VQAv2 (Goyal et al., 2017) for comparison.Since we are the first to evaluate CLIP's zero-shot ability in SNLI-VE and VCR, there is no need for us to compare them with prior work.So we just exploit CLIP ViT-B/16 in VCR and SNLI-VE.We believe the scale of the model will have a big impact on the result, so we also utilize CLIP ViT-L/14@336px in VQA, VCR, and SNLI-VE to see how much improvement can be obtained by using a larger model.In addition to CLIP, we also use T5-large2 for task format conversion, OFA-base3 for image captioning, RoBERTa-large4 for the following calculation of cosine similarity, and Faster-RCNN5 for object detection.

Benchmark comparison
VQA. Results of zero-shot VQA are reported in Tab. 1.For a fair comparison, we compare our method with two CLIP-based methods.We choose TAP-C (Song et al., 2022) as our baseline.Since the author didn't release the code, we reimplement it from scratch.After reimplementation, we obtain a lower score than TAP-C.There might be some reasons like differences in specific prompt design and answer filtering process making our result different from the original one.Although our reimplemented results are lower than the reported ones, we can surpass TAP-C after extracting and exploiting visual and textual fine-grained information.Compared to our reimplemented results, our method can elevate the performance of all answer types.By using a larger CLIP model, we can achieve better performance.Our best performance can surpass the reimplemented and TAP-C result by 2.83% and 1.63%.Currently, our method outperforms previous CLIP-based methods for zero-shot VQA.

SNLI-VE.
We report the results of SNLI-VE in Tab. 2. By using the baseline method, we can get an accuracy of 47.37% in all categories, which is 14.04% higher than random performance.This result reveals that our baseline method is strong and it confirms CLIP's zero-shot ability in SNLI-VE.By extracting fine-grained information and upscaling the model, we can increase accuracy by 2.79% at most.For each answer type, Neutral type increases the most (+10.91%)and Entailment type decreases by 3.24%.We need to note that Neutral type is more complex than Entailment and Contradiction since this type is not as clear as the other two types requiring a model's deeper reasoning ability.The improvement in Neutral type shows the significance of fine-grained information.As for the decrement in Entailment type, it is likely due to the deficiency of our clustering method, which should be improved in the future.Since there is no CLIP-based zero-shot method for SNLI-VE before, we choose the supervised method EVE-Image from SNLI-VE paper (Xie et al., 2019) for comparison.Although the overall performance is still not comparable to the supervised method, our result of Contradiction type is approaching EVE-Image.
VCR.The results of VCR are reported in Tab. 2. We carry out experiments in two VCR subtasks, namely Q2A and QA2R.Compared to the random performance of Q2A and QA2R, our baseline method can improve by 28.24% and 21.51% respectively.The improvement confirms CLIP's strong zero-shot ability for VCR.By extracting fine-grained information and using a larger model, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  we can improve the baseline by 5.24% and 5.37% at most, which proves the effectiveness of our proposed method.There is no prior CLIP-based method for zero-shot VCR so we select the supervised model R2C, proposed in VCR paper (Zellers et al., 2019), for comparison.Although we cannot surpass the supervised model, the result of Q2A is approaching R2C and our results are competitive.

Ablation studies
In this section, we will analyze every important component of our proposed method.In Tab. 3, we can see all of the fine-grained (FG) information can help zero-shot learning and all fine-grained (FG) information combined together can bring more improvement.Textual FG Information -Question: By adding the question prior information, we can see it can help VCR the most.We think the first reason is the question and answer in VCR are longer and more complex than the other two datasets.Consequently, the question and answers can provide more useful and richer information in zero-shot inference.Secondly, the correct answer is likely to have more overlap with the question.Plus, we can observe that question doesn't help a lot in VQA Yes/No answer type since this is a binary classification problem and a large number of questions are like "Is this A or B?" type, which cannot provide more useful information to zero-shot prediction.Visual FG Information -Image Region: We can observe that the image region can largely improve the performance of Other answer type in VQA because the questions of this type tend to query the details of the image.And image regions can provide finer details to zero-shot inference.At the same time, we also find that the image region cannot help SNLI-VE much.We think SNLI-VE concentrates more on the global image thus image  We can also notice that using image captions may hurt some categories of SNLI-VE, we think this result may suffer from the quality of the generated caption.
Generation vs. Ground Truth: Since not every dataset is well human-annotated, we employ these two settings to test the generalizability of our proposed method.In the generation setting, we generate image captions by OFA and detect objects by Faster-RCNN.In the ground truth setting, as mentioned above, there are ground truth captions paired with images in VQA and SNLI-VE.For VCR, images are not paired with human-annotated caption texts.However, 68% images of VCR validation set are the same as images in VisualCOMET (Park et al., 2020) and VisualCOMET is paired with the ground truth caption.Thus, we directly leverage captions from VisualCOMET in VCR.Although images in VCR are not paired with captions, they are annotated with ground truth bounding boxes, so we have a ground truth image region experiment for VCR.However, VQA and SNLI-VE are not annotated with ground truth bounding boxes.As Tab. 3 shows, we can conclude that our method can work well in a situation without many annotations because we achieve similar performance in generation and ground truth scenarios, which confirms the generalizability of our proposed method.
Model Scale: We believe that the model scale will affect the final result since larger models are able to better process visual and textual information.In our experiments, we mainly focus on two variants of CLIP, namely CLIP ViT-B/16 and CLIP ViT-L/14@336px.We also carry out experiments on CLIP Res50x16 in VQA task, which can be found in Tab. 6.We can observe that larger models can elevate the performance and all of our best results are achieved by using CLIP ViT-L/14@336px.

Number of Image Regions:
In this subsection, we would like to see how selected N image regions affect the zero-shot performance of different vision-language tasks.For convenience, we select Yes/No answer type of VQA, SNLI-VE, and Q2A task of VCR to carry out experiments.Full results are reported in Tab. 8.For better visualization, we normalize the results.In Fig. 4, we can observe that with the increment of the image regions, the performance of all three tasks increases and then decreases.Moreover, selecting 5 image regions is optimal for VQA and VE.For VCR, 12 image regions are optimal.Visual fine-grained information can help CLIP and play an important role in the zero-shot prediction since it provides fine details of the image but more image regions after a certain point will result in a decrement.Too many image regions will introduce irrelevant visual information.
In our experiments, we select 5 regions for VQA and SNLI-VE, and 12 regions for VCR.

Conclusion
In this work, we propose a unified and fine-grained approach for vision-language tasks including VQA, SNLI-VE, and VCR.We outperform previous CLIP-based methods for zero-shot VQA.Plus, we are the first to empirically study CLIP's zero-shot ability for SNLI-VE and VCR, which achieves strong zero-shot performance.In addition to the benchmark comparison, we conduct extensive ablation studies confirming the significance of visual and textual fine-grained information and the generalizability of our proposed method.

Limitations
Although our proposed method is effective in three vision-language tasks, we still have some limitations.Firstly, we utilize T5 to convert the questionanswering format into the declarative sentence in VQA and it works well in most cases, but it still faces out-of-coverage problems, which will affect the following zero-shot prediction of CLIP.We need to design more rules for these special cases for better conversion.Secondly, our clustering algorithm for SNLI-VE can achieve strong zero-shot performance, but the clustering centroids are close to each other and the algorithm is sensitive to these centroids.The robustness of this algorithm should be improved.What's more, we leverage Faster-RCNN in visual fine-grained information extraction, so the detectable object attributes and classes are constrained in a relatively limited object set of Faster-RCNN, which may hinder further improvement from visual fine-grained information.The Faster-RCNN can be replaced with a better vision module.Besides, since we only utilize CLIP in our paper, we can explore the zero-shot ability of other contrastive pre-training models in future work.

Ethics Statement
There are many large-scale pre-trained models used in our paper like OFA, T5, RoBERTa, and CLIP.Following previous work, we use the val2014 split of VQAv2.In zero-shot SNLI-VE and VCR, we use the validation set.
A.2 Answer filtering for VQA Answer filtering.As in TAP-C (Song et al., 2022), we first manually design the demonstrations and employ T5 to convert the question-answering format of VQA into the declarative template with the <extra_id_0> token.Then, we input the concatenation of demonstrations and declarative statements converted from question-answering format with the <extra_id_0> token into the T5 encoder.Next, encoded features from the T5 encoder and answer candidates are input into the T5 decoder.At the end of the T5 decoder, it will calculate the probability of each answer candidate.We select Top K answers to replace <extra_id_0> token in the template to generate K prompts, which will be fed into the CLIP text encoder.
Setting of hyperparameter K. Since we employ answer filtering to select top K answers, K is a significant hyperparameter.In Tab. 5, we show how the zero-shot performance of VQA Number and Other type varies with the increment of selected top K answers.We carry out six and seven experiments on these two types.We can observe that with the increment of K, the performance first increases and then decreases.When K is small, many correct answers are directly removed by T5, which makes it impossible for CLIP to choose the right answer.Conversely, if K is very big, there are too many answers, which are likely to disturb CLIP's zero-shot prediction.In our experiments, we select the top 10 answers in VQA Other type and the top 4 answers in VQA Number type.

Algorithm 1 Pseudocode of clustering algorithm
Input: V: CLIP image encoder, T: CLIP text encoder, I: all images in SNLI-VE val split, H: all hypotheses in SNLI-VE val split, N : the number of samples in SNLI-VE val split; Output: centroid.
1: dictionary centroid initialized to 0 2: array scores initialized to 0 3: // use CLIP to calculate dot product 4: for i = 0; i < N ; i + + do In fact, we can cache the centroids in advance.In order to achieve better performance, we tune the centroids, which are reported in Tab. 7. The effectiveness of Algo. 1 is based on the relatively even data distribution.K-Means6 can also be utilized here but it also requires a relatively even data distribution.In the validation split of SNLI-VE, we have 17858 samples, which are not divisible by 3.However, we can assume there are 5952, 5953, and 5953 samples in entailment, neutral, and contradiction category respectively.

A.5 How # image regions affect performance
The full results are reported in Tab. 8.They are values before normalization in Fig. 4. Through the table and figure, we can see how selected N images affects the zero-shot performance.
A.6 Zero-shot learning by only using textual fine-grained information We think it is interesting to investigate the zero-shot performance if we only use textual fine-grained information.We only exploit the language model to accomplish zero-shot prediction in all three visionlanguage tasks.All results are shown in Tab. 9.
In VQA, we use T5-large (for answer filtering) and RoBERTa-large.In SNLI-VE and VCR, we only utilize RoBERTa-large.Visual information is not considered and textual fine-grained information includes the image caption and question in this experimental setting.All results show that only using textual fine-grained information can achieve fair performance.(Note: We can notice that only using ground truth textual fine-grained information in SNLI-VE can surpass baseline performance.It is because the relation between the ground truth caption and hypothesis is well annotated in SNLI (Bowman et al., 2015))

Figure 1 :
Figure 1: Examples of how fine-grained information is utilized to help CLIP from VQA, SNLI-VE, and VCR.Before extracting the fine-grained information, CLIP gives the wrong answer shown as the red box.With the assistance of visual and textual fine-grained information, CLIP can make the correct decision as the green box shows.(For visualization, only three answer choices are kept in VQA and VCR.And unisex names (Riley and Jackie) are added in VCR.)

Figure 2 :
Figure 2: Baseline method of UniFine.(Note: For visualization, only three answer choices are kept in VQA and VCR.CLIP-VE denotes CLIP Visual Encoder, and CLIP-TE denotes CLIP Text Encoder.) the flowers is yellow.The color of the flowers is yellow and white.The color of the flowers is white.Answers The flowers are on a table.Hypothesis A vase filled with yellow and white flowers on a table.

Figure 3 :
Figure 3: The overview of our proposed UniFine method.Visual fine-grained information is extracted under the guidance of the question (in VQA and VCR) or hypothesis (in SNLI-VE).In addition, textual fine-grained information is extracted by utilizing the question (in VQA and VCR) or hypothesis (in SNLI-VE) and the image caption.(Note: CLIP-VE denotes CLIP Visual Encoder and CLIP-TE denotes CLIP Text Encoder.)

and
QA2R are two subtasks of VCR.Q2A is similar to VQA in that there is only one question per sample.So the process of Q2A is the same as VQA except for omitting answer filtering.QA2R aims to dig out the rationale why one correct answer is chosen in Q2A question.Since there is no question text in QA2R and the correct answer is provided, we directly utilize the correct answer as the question text.Other procedures in QA2R are the same as Q2A.Visual Entailment: SNLI-VE The task format of SNLI-VE is different from VQA and VCR.For each sample, only one image premise I and one hypothesis H are given, without answer candidates.It is a three-way classification problem, aiming to predict the relation between the image premise and hypothesis text into one of three classes: Entailment, Contradiction, and Neutral.

Figure 4 :
Figure 4: Normalized value to the number of image regions.
The images in VCR are collected from movie clips.SNLI-VE originated from Stanford Natural Language Inference (SNLI) (Bowman et al., 2015), which is a text entailment (TE) task based on the Flicker30k (Young et al., 2014) image captions.It extends TE into the visual domain and it has a different task format from the VQA and VCR because the previous question-answering format is replaced with the hypothesis.Given the image and hypothesis, the model needs to predict whether the image semantically entails the text.The images in SNLI-VE are from Flicker30k with annotated captions.
n CLIP , C c CLIP }.Besides the CLIP alignment score comparison, we can obtain the caption prior score S Caption (I, H) using the image caption generated by OFA.Same as above, we also use the clustering method in Algo. 1, with only changing CLIP score to caption score, to get three centroids {C e Caption , C n Caption , C c Caption }.And we measure how close S Caption of each sample is to each centroid:

Table 3 :
InTab. 3, we can observe that the image caption can better assist the Number and Other answer type in VQA.For Number type, we think the image caption may contain numerical information which aids zero-shot prediction of Number type.Since there are a large number of questions in Other type, they will cover diverse question types, some of which may focus on information on an instance level.Normally, the image caption captures the instance-level information, so it can help VQA Other answer type.