Check It Again: Progressive Visual Question Answering via Visual Entailment

While sophisticated Visual Question Answering models have achieved remarkable success, they tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers. In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task, which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.


Introduction
Visual Question Answering (VQA) task is a multimodal problem which requires the comprehensive understanding of both visual and textual information. Presented with an input image and a question, the VQA system tries to determine the correct answer in the large prediction space. Recently, some studies (Jabri et al., 2016;Agrawal et al., 2016;Zhang et al., 2016;Goyal et al., 2017) demonstrate that VQA systems suffer from the superficial correlation bias (i.e. language priors) caused by accidental correlations between answers and questions. As a result, traditional VQA models always output the However, through exploring the characteristics of the existing methods, we find that whether the general VQA models such as UpDn (Anderson et al., 2018) and LXMERT (Tan and Bansal, 2019) or models carefully designed for language priors, as LMH (Clark et al., 2019) and SSL , yield a non-negligible problem. Both models predict the correct answer according to one best output without checking the authenticity of answers. Besides, these models have not made good use of the semantics information of answers that could be helpful for alleviating the language-priors.
As presented in Figure 1(a), quite a few correct answers usually occur at top N candidates rather than top one. Meanwhile, if the top N candidate answers are given, the image can further verify the visual presence/absence of concepts based on the combination of the question and the candidate answer. As shown in Figure 1(b), the question is about the color of the bat and two candidate answers are "yellow" and "black". After checking the correctness of candidate answers, the wrong answer "yellow" which is contradicted with the image can be excluded and the correct answer "black" which is consistent with the image is confirmed. Nevertheless, this visual verification, which utilizes answer semantics to alleviate language priors, has not been fully investigated.
In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. The intuition behind the proposed framework comes from two observations. First, after excluding the answers unrelated to the question and image, the prediction space is shrunken and we can obtain a small number of candidate answers. Second, on the condition that a question and one of its candidate answer is bridged into a complete statement, the authenticity of this statement can be inferred by the content of the image. Therefore, after selecting several possible answers as candidates, we can utilize the visual entailment, consisting of image-text pairs, to verify whether the image semantically entails the synthetic statement. Based on the entailment degree, we can further rerank candidate answers and give the model another chance to find the right answer. To summarize, our contributions are as follows: 1. We propose a select-and-rerank progressive framework to tackle the language priors problem, and empirically investigate a range of design choices for each module of this framework. In addition, it is a generic framework, which can be easily combined with the existing VQA models and further boost their abilities.
2. We highlight the verification process between text and image, and formulate the VQA task as a visual entailment problem. This process makes full use of the interactive information of image, question and candidate answers.
3. Experimental results demonstrate that our framework establishes a new state-of-the-art accuracy of 66.73%, outperforming the existing methods by a large margin.

Related Work
Language-Priors Methods To address the language prior problem of VQA models, a lot of approaches have been proposed, which can be roughly categorized into two lines: (1) Design-ing Specific Debiasing Models to Reduce Biases. Most works of this line are ensemble-based methods (Ramakrishnan et al., 2018;Grand and Belinkov, 2019;Cadene et al., 2019;Clark et al., 2019;Mahabadi and Henderson, 2019), among these, LMH (Clark et al., 2019) reduces all biases between question-answer pairs by penalizing the samples that can be answered without utilizing image content. (2) Data Augmentation to Reduce Biases. The main idea of such works (Zhang et al., 2016;Goyal et al., 2017; is to carefully construct more balanced datasets to overcome priors. For example, the recent method SSL  first automatically generates a set of balanced question-image pairs, then introduces an auxiliary self-supervised task to use the balanced data. CSS (Chen et al., 2020a) balances the data by adding more complementary samples which are generated by masking objects in the image or some keywords in the question. Based on CSS, CL (Liang et al., 2020) forces the model to utilize the relationship between complementary samples and original samples. Unlike SSL and CSS which do not use any extra manual annotations, MUTANT (Gokhale et al., 2020) locates critical objects in the image and critical words in the question by utilizing the extra object-name labels, which directly helps the model to ground the textual concepts in the image. However, above methods only explore the interaction between the image and the question, ignoring the semantics of candidate answers. In this paper, we propose a progressive VQA framework SAR which achieves better interaction among the question, the image and the answer.
Answer Re-ranking Although Answer Reranking is still in the infancy in VQA task, it has been widely studied for QA tasks like open-domain question answering, in which models need to answer questions based on a broad range of opendomains knowledge sources. Recent works (Wang et al., 2018b,a;Kratzwald et al., 2019) address this task in a two-stage manner: extract candidates from all passages, then focus on these candidate answers and rerank them to get a final answer. RankVQA (Qiao et al., 2020) introduces Answer Re-ranking method to VQA task. They design an auxiliary task which reranks candidate answers according to their matching degrees with the input image and off-line generated image captions. However, RankVQA still predicts the final answer from the huge prediction space rather than selected candidate answers. Figure 2 shows an overview of the proposed selectand-rerank (SAR) framework, which consists of a Candidate Answer Selecting module and an Answer Re-ranking module. In the Candidate Answer Selecting module, given an image and a question, we first use a current VQA model to get a candidate answer set consisting of top N answers. In this module, the answers irrelevant to the question can be filtered out. Next, we formulate the VQA as a VE task in the Answer Re-ranking module, where the image is premise and the synthetic dense caption (Johnson et al., 2016) (combination of the answer and the question ) is hypothesis. We use the cross-domain pre-trained model LXMERT (Tan and Bansal, 2019) as VE scorer to compute the entailment score of each image-caption pair, and thus the answer corresponding to the dense caption with the highest score is our final prediction.

Candidate Answer Selecting
The Candidate Answer Selector (CAS) selects several answers from all possible answers as candidates and thus shrinks the huge prediction space.
with M samples, where I i ∈ I, Q i ∈ Q are the image and question of the i th sample and A is the whole prediction space consisting of thousands of answer categories. Essentially, the VQA model applied as CAS is a |A|-class classifier, and is a free choice in our framework. Given an image I i and a question Q i , CAS first gives the regression scores over all optional answers: P (A|Q i , I i ). Then CAS chooses N answers A * i with top N scores as candidates, which is concluded as follows: for the next Answer Re-ranking module. In this paper, we mainly use SSL as our CAS. We also conduct experiments to analyze the impact of different CAS and different N .

Visual Entailment
Visual Entailment (VE) task is proposed by Xie et al. (2019), where the premise is a real-world image, denoted by P image , and the hypothesis is a text, denoted by H text . Given a sample of (P image , H text ), the goal of VE task is to determine whether the H text can be concluded based on the information of P image . According to following protocols, the label of the sample is assigned to (

VQA As Visual Entailment
A question Q i and each of its candidate answers A * i can be bridged into a complete statement, and then the image could verify the authenticity of each statement. More specifically, the visual presence of concepts (e.g. "black bat"/"yellow bat") based on the combination of the question and the correct/wrong candidate answer can be entailed/contradicted by the content of the image. In this way, we achieve better interaction among question, image and answer.
Therefore, we formulate VQA as a VE problem, in which the image I i is premise, and the synthetic statement of an answer A n i in A * i and question Q i , represented as (Q i ,A n i ), is hypothesis. For an image, synthetic statements of different questions describe different regions of the same image. Following Johnson et al. (2016), we also refer to the synthetic statement as "dense caption".
. Note that, there is no Neutral label in our VE task and we only have two labels: Entailment and Contradiction.

Re-Ranking based on VE
We re-rank dense captions by contrastive learning, that is, . The right part of Figure 2 illustrates this idea. The more semantically similar I i to (Q i ,A n i ), the deeper the visual entailment degree is. We score the visual entailment degree of and rerank the candidate answers A * i by this score. The ranking-first answer is our final output.

Question-Answer Combination Strategy
The answer information makes sense only when combine it with the question. We encode the combination of question and answer text to obtain the joint concept.
We design three question-answer combination strategies: R, C and R→C to combine question and answer into synthetic dense caption C i : R: Replace question category prefix with answer. The prefix of each question is the question category such as "are there", "what color", etc. For instance, given a question "How many flowers in the vase?", its answer "8" and its question category "how many", the resulting dense caption is "8 flowers in the vase". Similarly, "No a crosswalk" is the result of question " Is this a crosswalk?" and answer "No". We build a dictionary of all question categories of the train set, then we adopt a Forward Maximum Matching algorithm to determine the question category for every test sample.
C: Concatenate question and answer directly. For two cases above, the resulting dense captions are "8 How many flowers in the vase?" and "No Is this a crosswalk?". The resulting dense captions after concatenation are actually rhetorical questions.
We deliberately add answer text to the front of question text in order to avoid the answer being deleted when trimming dense captions to the same length.

R→C:
We first use strategy R at training, which is aimed at preventing the model from excessively focusing on the co-occurrence relation between question category and answer, and then use strategy C at testing to introduce more information for inference.
Adopting any strategy above, we combine Q i and each answer in A * i to derive the dense captions C * i . And thus we have a dataset D = {Ii, C n i } M ,N i=1,n=1 with M * N instances for VE task. VE Scorer We use the pre-trained model LXMERT to score the visual entailment degree of (I i , C n i ). LXMERT separately encodes image and caption text in two streams. Next, the separate streams interact through co-attentional transformer layers. In the textual stream, the dense caption is encoded into a high-level concept. Then the visual representations from visual stream can verify the visual presence/absence of the high-level concept.
We represent the VE score for the i th image and its n th candidate caption as: where T rm() is the 1-demensional output from the dense layers following LXMERT, δ() denotes the sigmoid function. The larger score represents higher entailment degree. We optimize parameters by minimizing the multi-label soft loss: [t n i log(δ(T rm(I i , C n i ))) (2) where t n i is the soft target score of the n th answer.

Combination with Language-Priors Method
After Candidate Answer Selecting, the amount of candidate answers decreases from all possible answers to top N . Although some unrelated answers are filtered out, the dataset D for VE system is still biased. Therefore, we can optionally apply existing language-priors methods to our framework for further reducing language priors. Take the SSL as an example, we apply the loss function of its self-supervised task to our framework by adjusting the loss function to: where (I i , C n i ) denotes the irrelevant imagecaption pairs, α is a down-weighting coefficients.
The probability P (I i , C n i ) could be considered as the confidence of (I i , C n i ) being a relevant pair. We can reformulate the overall loss function:

Inference Process
Question Type Discriminator Intuitively, most "Yes/No" questions can be answered by the answer "Yes" or "No". There is no need to provide too many candidate answers for "Yes/No" questions at the test stage. Therefore, we propose a Question Type Discriminator(QTD) to determine the question type and then correspondingly set different numbers of candidate answers, denoted as N . Specifically, we roughly divided question types (including "Yes/No", "Num" and "Other") into yes/no and non-yes/no. A GRU binary classifier is trained with cross-entropy loss and evaluated with 5-fold cross-validation on the train split of each dataset. Then, the trained QTD model with an accuracy about 97% is implemented as an off-line module during the test stage. We will further investigate the effect of N on each question type in the next section.

Final Prediction
In the inference phase, we search for the best dense captionĈ i among all candidates C * i for the i th image.
The answerÂ i corresponding toĈ i is the final prediction.

Setting
Datasets Our models are trained and evaluated on the VQA-CP v2  dataset, which is well-crafted by re-organizing VQA v2 (Goyal et al., 2017) training and validation sets such that answers for each question category (65 categories according to the question prefix) have different distributions in the train and test sets. Therefore, VQA-CP v2 is a natural choice for evaluating VQA model's generalizability. The questions of VQA-CP v2 include 3 types: "Yes/No", "Num" and "Other". Note that the question type and question category (e.g."what color") are different. Besides, we also evaluate our models on the VQA v2 validation set for completeness, and compare the accuracy difference between two datasets with the standard VQA evaluation metric (Antol et al., 2015).
Baselines We compare our method with the following baseline methods: UpDn (

Implementation Details
In this paper, we mainly choose SSL as our CAS and set N =12 and N =20 for training. To extract image features, we follow previous work and use the pre-trained Faster R-CNN to encode each image as a set of fixed 36 objects with 2048-dimensional feature vectors. We use the tokenizer of LXMERT to segment each dense caption into words. All the questions are trimmed to the same length of 15 or 18, respectively for R or C question-answer combination strategy. In the Answer Re-ranking Module, we respectively incorporate two languagepriors methods, SSL and LMH, into our proposed framework SAR, which is dubbed as SAR+SSL and SAR+LMH. Our models are trained on two TITAN RTX 24GB GPUs. We train SAR+SSL for 20 epochs with batch size of 32, SAR and SAR+LMH for 10 epochs with batch size of 64. For SAR+SSL, we follow the same setting as the original paper , except that we don't need to pre-train the model with the VQA loss before fine-tuning it with the self-supervised loss. The Adam optimizer is adopted with the learning rate 1e-5. For Question Type Discriminator, we use 300dimensional Glove (Pennington et al., 2014) vectors to initialize word embeddings and feed them into a unidirectional GRU with 128 hidden units. When testing on the VAQ-CP v2, N ranges from 1-2 for yes/no questions and 5-15 for non-yes/no questions. As for VQA v2, N ranges from 1-2 for yes/no   57.59 86.53 29.87 50.03 63.73 ---6.14 CSS (Chen et al., 2020a) 58 questions and 2-5 for non-yes/no questions.

Main Results
Performance on two benchmarks VQA-CP-v2 and VQA-v2 is shown in Table 1. We report the best results of SAR, SAR+SSL and SAR+LMH among 3 question-answer combination strategies respectively. "TopN-" represents that N candidate answers (selected by CAS) feed into the Answer Reranking Module for training. Our approach is evaluated with two settings of N (12 and 20). From the results on VQA-CP v2 shown in Table 1, we can observe that: (1) Top20-SAR+LMH establishes a new state-of-the-art accuracy of 66.73% on VQA-CP v2, beating the previous bestperforming method CL by 7.55%. Even without combining language-priors methods in Answer Re-ranking module, our model Top20-SAR outperforms CL by 6.26%. These show the outstanding effectiveness of our proposed SAR framework.
(2) SAR+SSL and SAR+LMH achieve much better performance than SSL and LMH, which demonstrates that SAR is compatible with current language-priors methods and could realize their full potential. (3) Compared with another reranking-based model RankVQA, our method elevates the performance by a large margin of 23.68%. This shows the superiority of our proposed progressive select-and-rerank framework over RankVQA which only uses the answer reranking as an auxiliary task. (4) Previous models did not generalize well on all question types. CL is the previ-ous best on the "Yes/No", "Num" questions and LXMERT on the "Other" questions. In comparison, our model not only rivals the previous best model on the "Yes/No" questions but also improves the best performance on the "Num" and "Other" questions by 12.45% and 3.65%. The remarkable performance on all question types demonstrates that our model makes a significant progress toward a truly comprehensive VQA model.
We also evaluate our method on the VQA v2 which is deemed to have strong language biases. As shown in Table 1, our method achieves the best accuracy of 70.63% amongst baselines specially designed for overcoming language priors, and is the closest to the SOTA established by LXMERT which is trained explicitly for the biased data setting. For completeness, the performance gap between two datasets is also compared in Table 1 with the protocol from Chen et al. (2020a). Compared with most previous models which suffer severe performance drops between VQA v2 and VQA-CP v2 (e.g., 27.93% in LXMERT), the Top20-SAR+LMH significantly decreases the performance drop to 2.49%, which demonstrates the effectiveness of our framework to further overcome the language biases. Though CSS achieves a better performance gap, it sacrifices the performance on the VQA v2. Meanwhile, as N rises from 12 to 20, our models achieve better accuracy on both datasets along with a smaller performance gap. This demonstrates that, unlike previous methods, our method can alleviate language priors while maintaining an excellent capability of answering questions. Nonetheless, we   believe that, how to improve the model's generality and further transform the trade-off between eliminating language priors and answering questions into win-win outcomes, is a promising research direction in the future.

The Effect of N
From Figure 3, we can observe that the overall performance is getting better as N increases. The performance improvement on the "Num" and "Other" questions is especially obvious, and there is a very slight drop on the "Yes/No" questions. We believe that SAR can further get better performance by properly increasing N . Due to the resource limitation, the largest N we use is 20 in this paper.

The Effect of Different CAS
To find out the potential performance limitation of CAS models, we show the accuracy of 3 CAS models on the VQA-CP v2 test set. As shown in Figure 1 (a), the Top3 accuracy (acc) of 3 models is about 70% and Top6 acc is 80%, which guarantees that sufficient correct answers are recalled by CAS. And thus, the performance limitation of CAS is negligible.
We also conduct experiments to investigate the effect of different CAS on SAR. From the results shown in Table 2, we can observe that: (1) Choosing a better VQA model as CAS does not guarantee a better performance, e.g. performance based on  Table 3: Results on the VQA-CP v2 test set based on different question-answer combination strategies: R, C and R→C. The major difference between R and C is whether keeping question prefix which includes 65 categories.
UpDn outperforms that based on LMH, but LMH is a better VQA model in overcoming language priors compared with UpDn. This is because a good Candidate Answer Selector has two requirements: (a) It should be able to recall more correct answers. (b) Under the scenario of language biases, wrong answers recalled by CAS at training time should have superficial correlations with the question as strong as possible. However, the ensemble methods, such as LMH, are trained to pay more attention to the samples which are not correctly answered by the question-only model. This seriously reduces the recall rate of those language-priors wrong answers, which leads to the training data for VE is too simple and thus hurts the model's capability of reducing language priors. (2) If CAS is the general VQA model UpDn rather than LMH and SSL, the improvement brought from the combination with language-priors method in Answer Re-ranking module is more obvious. (3) Even we choose the UpDn, a backbone model of most current works, as our CAS and do not involve any language-priors methods, SAR still achieves a much better accuracy than the previous SOTA model CL by 2.53%, which shows that our basic framework already possesses outstanding capability of reducing language priors.

The Effect of Question-Answer Combination Strategies
From the results shown in  tion of question category is useful in inference. (2) On the SAR and SAR+SSL, C consistently outperforms R, but on the SAR+LMH, we see opposite results. This is probably because our method and the balancing-data method SSL could learn the positive bias resulted from the superficial correlations between question category and answer, which is useful for generalization, but the ensemble-based method LMH will attenuate positive bias during de-biasing process. (3) Even without language priors method, SAR with R→C rivals or outperforms the SAR+SSL and SAR+LMH with R or C, which shows that R→C strategy could help the model to alleviate language priors. As a result, compared with R or C, our framework with R→C only gains a slight performance improvement after using the same language-priors methods.

Ablation Study
"CAS+" represents we use the select-and-rerank framework.
From Table 4, we can find that: (1) LXM+SSL represents directly applying SSL to LXMERT. Its poor performance shows that the major contribution of our framework does not come from the combination of the language-priors method SSL and pre-trained model LXMERT.
(2) Compared with LXM and LXM+SSL, CAS+LXM and CAS+LXM+SSL respectively gain prominent performance boost of 9.35% and 6.32%, which demonstrates the importance and effectiveness of our proposed selectand-rerank procedure. (3) CAS+LXM+QTD(R) and CAS+LXM+SSL+QTD(R) respectively outperform CAS+LXM(R) and CAS+LXM+SSL(R) by 3.93% and 2.71%, which shows the contribution of QTD module. This further demonstrates that choosing appropriate N for different question types is a useful step for model performance. (4) CAS+LXM+SSL+QTD improves the performance of CAS+LXM+QTD by 2.61%, which shows that  current language-priors methods fit our framework well and could further improve performance.

The Effect of N
From Figure 4, we can find that: (1) The best N for yes/no questions is smaller than that for nonyes/no questions due to the nature of yes/no question. (2) As N increases, the accuracy of "Num" and "Other" questions rises first and then decreases. There is a trade-off behind this phenomenon: when N is too small, the correct answer may not be recalled by CAS; when N is too large, the distraction from wrong answers makes it more difficult for model to choose the correct answer.

Qualitative Examples
We qualitatively evaluate the effectiveness of our framework. As shown in Figure 5, compared with SSL, SAR performs better not only in question answering but also in visual grounding. With the help of answer semantics, SAR can focus on the region relevant to the candidate answer and further use the region to verify its correctness.

Conclusion
In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select candidate answers to shrink the prediction space, then we rerank candidate answers by a visual entailment task which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Our framework can make full use of the interactive information of image, question and candidate answers. In addition, it is a generic framework, which can be easily combined with the existing VQA models and further boost their abilities. We demonstrate advantages of our framework on the VQA-CP v2 dataset with extensive experiments and analyses. Our method establishes a new state-of-the-art accuracy of 66.73% with an improvement of 7.55% on the previous best.