Digging out Discrimination Information from Generated Samples for Robust Visual Question Answering

,


Introduction
With the vigorous development of computer vision and natural language processing fields, it has promoted the vision-and-language (Gu et al., 2022;Wen et al., 2023b) field to take a forward step.As a typical task of the vision-and-language field, Visual Question Answering (VQA) (Anderson et al., 2018;Cadène et al., 2019a) requires an agent to fully comprehend the information of the questions and images, and then correctly answer the textual question according to the image.Although recent advances (Cadène et al., 2019a) have achieved impressive performance on the benchmark datasets * Corresponding author (e.g., VQA v2 (Goyal et al., 2017)), numerous studies (Agrawal et al., 2018;Kafle and Kanan, 2017) have shown that some VQA models tend to excessively rely on the superficial correlations (i.e., biases) between the questions and answers, instead of adopting reasoning ability to answer the questions.For example, the VQA models can easily answer "2" and "tennis" for the questions "How many . . ." and "What sports . . .", respectively, would obtain higher accuracy, since the corresponding answers "2" and "tennis" are occupied the most in the dataset.However, memorising the biases to answer the questions would signify flawed reasoning ability, resulting in poor generalisation ability.
To mitigate the bias issues, many methods have been proposed, which can be roughly categorised into three types: 1) enhance visual attention (Selvaraju et al., 2019;Wu and Mooney, 2019), 2) directly weaken the biases (Cadène et al., 2019b;Niu et al., 2021), and 3) balance the dataset (Chen et al., 2020;Zhu et al., 2020).Previous studies have shown the methods that balance the dataset usually outperform other types of methods, since they dig out the natural distribution of the data, and then devise a suitable strategy to overcome the biases.Specifically, CSS (Chen et al., 2020) and Mutant (Gokhale et al., 2020) methods generate counterfactual samples by masking the critical objects or words in the images and questions, respectively.However, these methods require additional annotations that are hard to obtain.To get rid of the dependence on the additional annotations, MMBS (Si et al., 2022) constructs the positive questions by randomly shuffling the question words or removing the words of question types, which destroys the grammar and semantics of the original questions.Moreover, SimpleAug (Kil et al., 2021) and KD-DAug (Chen et al., 2022) build the new samples by re-combining the existing questions and images, which may be difficult to assign correct answers for the generated samples.SSL-VQA (Zhu et al., 2020) and D-VQA (Wen et al., 2021) construct the negative samples by randomly sampling the images or questions in a mini-batch data.Nevertheless, these methods consider the negative samples only, but ignoring the generated positive samples would improve the diversity of the dataset and further promote the robustness of the VQA models.
To overcome the above issues, we propose a method to Dig out Discrimination information from Generated samples (DDG).As pointed out by (Wen et al., 2021), the bias issues exist in both vision and language modalities, we thus construct positive and negative samples in vision and language modalities and devise corresponding training objectives to achieve unbiased learning.Concretely, we feed the samples to the UpDn (Anderson et al., 2018) model pre-trained on the VQA-CP (Agrawal et al., 2018) v2 training set, and select k objects based on the top-k image attention weights of the UpDn as the positive images.The positive questions can be constructed by using the translate-and-back-translate mechanism, e.g., English → French → English.We then combine the positive images and positive questions with original questions and original images, respectively, as positive image and question samples.Based on the positive samples, we adopt a knowledge distillation mechanism (Hinton et al., 2015) to help the learning of the original samples.
Moreover, inspired by (Wen et al., 2021), we construct mismatched image-question pairs as negative samples.Generally speaking, one cannot answer the question correctly given the mismatched image-question pairs, since missing the supporting modality information.To promote the VQA models to focus on the vision and language modalities, we devise a training objective that aims to minimise the likelihood of predicting the ground-truth answers of the original samples when given the corresponding negative samples.Besides, we further introduce the corresponding positive samples to assist the training.Based on the above debiased techniques, our DDG achieves impressive performance on the VQA-CP v2 (Agrawal et al., 2018) and VQA v2 (Goyal et al., 2017) datasets, which demonstrates the effectiveness of our DDG.
Our contributions can be summarised as follows: 1) We devise a novel positive image samples generation strategy that uses the image attention weights of the pre-trained UpDn model to guide the selection of the target objects.2) We introduce the knowledge distillation mechanism to promote the learning of the original samples by the positive samples.3) We adopt the positive and negative samples to impel the VQA models to focus on the vision and language modalities, to mitigate the biases.

Overcoming biases in VQA
Recently, researchers have proposed vast debiased techniques (Selvaraju et al., 2019;Niu et al., 2021;Zhu et al., 2020;Wen et al., 2023a) to alleviate the bias issues in VQA, which can be roughly categorised into three types: 1) enhance the visual attention, 2) directly weaken the biases, 3) balance the dataset.Methods that enhance visual attention.These methods seek to adopt human-annotated information to strengthen the visual attention of the VQA models.Specifically, Selvaraju et al. (Selvaraju et al., 2019) aligned the important image regions identified based on the gradient with the human attention maps to enhance the visual attention in the VQA models.Wu et al. (Wu and Mooney, 2019) introduced a self-critical training objective that matches the ground-truth answer with the most important image region recognised by human explanations.However, these methods require human annotations that are hard to obtain.Methods that weaken the biases.Ramakrishnan et al. (Ramakrishnan et al., 2018) adopted adversarial learning to inhibit the VQA models capture the language biases.Inspired by (Ramakrishnan et al., 2018), Cadene et al. (Cadène et al., 2019b) devised a question-only model to generate weight to re-weight the samples.Moreover, Han et al. (Han et al., 2021) forced the biased models to capture different types of biases, and removed them step by step.Different from the above, Niu et al. (Niu et al., 2021;Niu and Zhang, 2021) introduced the idea of cause-effect to help alleviate the biases.Nevertheless, these methods introduce additional parameters in training or inference phrases.Methods that balance the dataset.CSS (Chen et al., 2020) and Mutant (Gokhale et al., 2020) methods generated massive counterfactual samples by masking the critical objects and words in the images and questions, respectively.However, these methods require additional annotations to assign the answers for the generated samples.To get rid of the dependence on the annotations, KD-DAug (Chen et al., 2022) and SimpleAug (Kil et al., 2021) constructed the samples by re-composing the existing questions and images, which however, is hard to assign the correct answers for the generated samples.SSL-VQA (Zhu et al., 2020) and D-VQA (Wen et al., 2021) constructed the negative samples by randomly sampling the images or questions in a mini-batch data.Nevertheless, these methods ignored the positive samples could improve the diversity of the dataset, which was helpful for improving the robustness of the VQA models.Moreover, Si et al. (Si et al., 2022) constructed the positive question samples by randomly shuffling the words or removing the words of question types, which destroys the semantics of the original questions.
Different from the above methods, we seek to construct positive samples and negative samples in both vision and language modalities, and devise corresponding debiased strategies to achieve unbiased learning.

Knowledge Distillation
Knowledge Distillation (KD) (Hinton et al., 2015) is a universal model compression method that seeks to train a small student model guided by a large teacher model.Due to the effectiveness of KD, the idea has been applied to other tasks, e.g., longtail classification (He et al., 2021;Xiang et al., 2020), object detection (Chen et al., 2017;Wang et al., 2019), and video captioning (Pan et al., 2020;Zhang et al., 2020).Recently, some debiased VQA methods (Niu and Zhang, 2021;Chen et al., 2022) introduced KD to alleviate the bias issues.Specifically, Niu et al. (Niu and Zhang, 2021) devised two teachers (i.e., ID-teacher and OOD-teacher) to generate "soft" labels to guide the training of the student model (i.e., the baseline model) with the KD mechanism.Inspired by IntroD, Chen et al. (Chen et al., 2022) adopted a multi-teacher KD mechanism to help generate robust pseudo labels for all newly composed image-question pairs.In our DDG, we seek to improve the reasoning ability of the VQA models with the help of the generated positive samples, via the KD mechanism.

Digging out Discrimination Information from Generated Samples
As shown by (Agrawal et al., 2018;Kafle and Kanan, 2017), VQA models tend to capture the biases in a dataset to answer questions, instead of adopting the reasoning ability, resulting in poor generalisation ability.Moreover, bias issues exist in both vision and language modalities (Wen et al., 2021).To address the above issues, we seek to construct both positive and negative samples in vision and language modalities, and devise corresponding debiased strategies to achieve unbiased learning.The overall framework is shown in Figure .1.

Preliminary
Visual question answering (VQA) requires an agent to answer a textual question given a corresponding image.Traditional VQA methods (Anderson et al., 2018;Kim et al., 2018;Ben-younes et al., 2019;Cadène et al., 2019a) regard the VQA task as a multi-class classification problem, where each class corresponds to a unique answer.To be specific, given a VQA dataset with N samples, where v i ∈ V (image set), q i ∈ Q (question set) are the i-th sample in D, and a i ∈ A (answer set) is a corresponding ground-truth answer, VQA methods seek to learn a multimodal mapping: to generate an answer distribution over the answer set A. Generally speaking, most VQA models usually contain four parts, namely, vision feature encoder e v (•), language feature encoder e q (•), multimodal feature fusion module f (•, •), and classifier c(•).These modules can be formed as a traditional VQA model: Formally, since regarding the VQA task as a multiclass classification problem, the VQA models can be optimised by a binary cross-entropy loss L vqa , which can be formulated as: where σ denotes the sigmoid activation function, and a i is the target score obtained based on the answer a i that humans annotated for (v i , q i ).

Sample Generation
Our method aims to adopt the generated samples to achieve unbiased learning.Hence we present how to generate the positive and negative samples at first.As pointed out by (Wen et al., 2021), biases exist in both language and vision modalities.To overcome the bias issue, we seek to generate positive and negative samples regarding the vision and language modalities for each original sample, to assist the training process.

Ground-Truth
Figure 1: Overview of our DDG method.we draw support from the pre-trained UpDn (Anderson et al., 2018) model and the translate-and-back-translate mechanism to generate the positive image and question samples, respectively, and then introduce a knowledge distillation mechanism to facilitate the learning of the original samples.Moreover, we construct the mismatched image-question pairs by randomly sampling the images or questions in a mini-batch data as negative samples.By using these negative samples, we enhance the attention of VQA models towards both vision and language modalities when answering the questions, thereby mitigating biases.
Positive samples generation.To mitigate the bias issue over vision and language modalities in VQA, we build two types of positive samples, i.e., a positive question sample, and a positive image sample.Specifically, to generate the positive image samples, we seek to draw support from the image attention weights in a pre-trained baseline VQA model.We have empirically found that although the baseline models (e.g., UpDn (Anderson et al., 2018)) achieve unsatisfactory performance in the out-of-distributions (OOD) test set (e.g., VQA-CP v2 (Agrawal et al., 2018) dataset), they still obtain promising performance in the independent and identically distributed (IID) dataset (e.g., VQA v2 dataset (Goyal et al., 2017)).In other words, the baseline models can identify the target objects in the images referred to in the questions to accomplish answering during the training process, regardless of whether capturing the biases.Hence, the image attention weights of the pre-trained UpDn (Anderson et al., 2018) model can help find target objects as positive image samples, which can exclude the background information of the images.
Given the sample (v i , q i ) from the VQA-CP v2 training set, we first feed it to the UpDn model pretrained on the VQA-CP v2 training set and would obtain the image attention weights of the UpDn model regarding the objects in image v i .Note that we select k objects based on the top-k image attention weights as the positive image samples where k is a hyper-parameters.
To generate the positive question samples, previous methods (Si et al., 2022) seek to adopt some data augmentation methods to expand the data, e.g., randomly shuffle the question words or remove question category words.However, these methods would severely destroy the grammar and semantics of the original question, resulting in changing the semantic information of the questions.To mitigate this issue, inspired by (Tang et al., 2020), we adopt the translate-and-back-translate mechanism to generate the positive question samples.Specifically, we first use pre-trained English-to-French and English-to-German translation models to translate the original question to French and German, respectively.Then we use corresponding pretrained back-translation models to translate them back into English. 1 Moreover, we further adopt a pre-trained sentence similarity model to choose a back-translated question sample that has the highest similarity score with the original question as the positive question sample.Note that for some sim-ple questions, they would still keep the same even feeding them to the translate-and-back-translate process.To generate positive question samples for these questions, we substitute the words in the question with synonyms based on the pre-trained synonym word substitution model. 2 In this way, we obtain the positive question samples (v i , q + i ).Based on the above, we would obtain two types of positive samples (i.e., (v + i , q i ) and (v i , q + i )) for each sample (v i , q i ), in which the positive image samples have foreground information in the image and the positive question samples are semantic equivalent to the original question.Negative samples generation.Inspired by (Wen et al., 2021), we construct the negative samples over language and vision modalities by randomly sampling one question and one image in a minibatch data for each sample.Specifically, given a mini-batch data {(v b , q b )} B b=1 , for each sample (v i , q i ), we randomly sample one image v − i and one question q − i from {(v b , q b )} B b=1 to form the negative samples, namely, negative question sample (v i , q − i ) and negative image sample (v − i , q i ).

Generated Samples Driven Robust VQA
Positive samples driven robust VQA.We attempt to achieve robust VQA with the help of the generated positive samples.Specifically, given an original sample (v i , q i ) and its counterpart positive samples (v + i , q i ) and (v i , q + i ), we first feed them into the VQA models to obtain the predictions P (A|v i , q i ), P (A|v + i , q i ), and P (A|v i , q + i ).Generally speaking, the ensemble predictions usually perform better than the predictions before the ensemble.We thus adopt a simple ensemble strategy (i.e., averaging these predictions) to obtain ensemble predictions P ens = (P (A|v i , q i ) + P (A|v + i , q i ) + P (A|v i , q + i ))/3.One intuitive way to make the VQA models achieve better performance with the help of the positive samples is to adopt a knowledge distillation mechanism (Hinton et al., 2015).Concretely, we regard the ensemble prediction P ens and the original prediction P (A|v i , q i ) as a teacher and a student, respectively, and then introduce a Kullback-Leibler (KL) Divergence L dis as the objective to optimise the VQA models, which can be formulated as: .
(3) 2 The pre-trained model is from the GitHub repository.
By minimising the KL divergence, the VQA models can extract discrimination information from the positive samples to help better answer the original questions q i correctly based on the images v i .To guarantee the teacher (i.e., the ensemble prediction P ens ) performs better than the student (i.e., the original prediction P (A|v i , q i )), we still use the binary cross-entropy loss L ens on P ens to further optimise the VQA models.Negative samples driven robust VQA.Besides adopting the positive samples to assist the training process, we also introduce the debiased strategy on negative samples to alleviate the bias issues.As shown by (Agrawal et al., 2018), the bias issue usually denotes the VQA models tend to capture the superficial correlations between one modality and the answers to make a prediction on the questions.To mitigate this issue, one direct solution is to improve the attention on both language and vision modalities information when the VQA models answer the questions.We thus consider adopting the negative samples to achieve this aim.
Intuitively, given a mismatched image-question pair, the VQA models even the human being cannot make a correct prediction.Drawing from this insight, when given original samples and the counterpart negative samples, we can alleviate the biases by giving contrary training objectives to the negative samples.This encourages the VQA models to answer the questions by paying more attention to the information of each modality.Concretely, inspired by (Wen et al., 2021), given an original sample (v i , q i , a i ) and its counterpart negative samples (v − i , q i ) and (v i , q − i ), the VQA models cannot answer correctly when feeding the negative samples, which can be achieved by minimising the possibility of predicting the ground-truth answer: ) where x is the index of ground-truth answer a i in the answer set A, and δ is the softmax activation function.Minimising the training objective L neg encourages the VQA models not to give the ground-truth answer when feeding the mismatched image-question pairs.Thus, the VQA models are able to consider both image and question information before making a prediction, which implicitly alleviates the bias issue.
Moreover, to further enhance the attention of VQA models towards both vision and language modalities, we introduce positive samples into the training process.Specifically, given positive image samples (v + i , q i , a i ) and negative image samples (v − i , q i ), when feeding them to the VQA models, one hopes the VQA models can answer correctly with high confidence on the positive samples, while having low prediction confidence to the groundtruth answer with the negative samples.This can be formulated as: Maximising the objective encourages the VQA models to make accurate predictions for the matched image-question pairs, while discouraging the models from generating the ground-truth answer when provided with negative image samples.This impels the VQA models to allocate more attention to the vision modality.
By leveraging the monotonicity property of the logarithmic function log, we convert the maximisation problem into an equivalent minimisation problem.This transformation can be mathematically formulated as follows: Moreover, with regard to the negative samples, the prediction scores associated with the groundtruth answer of the corresponding positive samples can serve as an indicator of the extent to which VQA models capture biases.The higher the prediction score δ(P (A|v − i , q i ))[x]), the greater the degree of bias it represents.Therefore, it should be subject to a higher penalty in the loss L img .Inspired by the focal loss (Lin et al., 2017), we consider the prediction score δ(P (A|v − i , q i ))[x]) as a measure of the degree of bias, and thus the loss L weight img can be reformulated as: We would obtain L weight que in the same way.Thus the weighted loss is formulated as follows: (7)

Overall Training Objective
In total, our overall training objective can be formulated as: where λ is a hyper-parameter.

Datasets
We evaluate our DDG on the OOD dataset VQA-CP v2 (Agrawal et al., 2018) and IID dataset VQA v2 (Goyal et al., 2017) validation set based on the standard evaluation metric (Antol et al., 2015).Due to the page limitation, we put the implementation details and compared methods into the Appendix.

Quantitative results
We report the experimental results on the VQA-CP v2 and VQA v2 datasets in Table 1.From these results, we have the following observations: 1) On the whole, the methods that balance the datasets outperform the other two types of methods i.e., enhance visual attention and directly weaken the biases.This demonstrates that alleviating the biases by paying more attention to the natural distribution of the data would obtain higher performance.
2) Our DDG outperforms most compared methods.Specifically, our DDG surpasses SCR (Wu and Mooney, 2019), GGE-DQ (Han et al., 2021), SSL-VQA (Zhu et al., 2020), and KDDAug (Chen et al., 2022) by approximately 12%, 3%, 3%, and 1%, respectively.These results demonstrate the effectiveness of our DDG.3) Although our method performs slightly worse than the Mutant (Gokhale et al., 2020) and D-VQA (Wen et al., 2021), our DDG achieves higher performance on the VQA v2 dataset.Moreover, Mutant constructed the counterfactual samples highly relying on the additional annotations, while our method build the samples without introducing additional annotations.Meanwhile, compared to the D-VQA method, our method performs better when the data is limited, which can be shown in Table 2.These results further demonstrate the effectiveness of our DDG.
Benefiting from the training process based on the positive samples, our DDG performs better than all the compared methods on the VQA v2 dataset.Specifically, our DDG outperforms SSL-VQA (Zhu et al., 2020) and D-VQA (Wen et al., 2021) by around 1.8% and 0.6%, respectively, which demonstrates our DDG is able to improve the model performance on both IID (i.e., VQA v2 dataset) and OOD (i.e., VQA-CP v2) datasets, further implying the superiority of our DDG.

Qualitative results.
To further demonstrate the effectiveness of our DDG on alleviating the biases, we provide the  et al., 2018).I -IV denote methods that enhance visual attention, directly weaken the biases, balance the dataset using additional annotations, and balance the dataset without introducing additional annotations, respectively.
qualitative results on the VQA-CP v2 dataset in Figures. 2 and 3. From the results in Figure 2, UpDn (Anderson et al., 2018) and SSL-VQA (Zhu et al., 2020) fail to find the target objects mentioned in the question within the image, leading to erroneous predictions.In contrast, our DDG demonstrates a remarkable ability to accurately localize the target objects with a high degree of confidence, resulting in precise answers to the posed questions.we find that our method performs better when the training data is limited, which is more practical and suitable for the real world.Specifically, our method performs better than SSL-VQA (Zhu et al., 2020) and D-VQA (Wen et al., 2021) on any proportions of the training data, especially when only remains 20% of the training data, our DDG outperforms SSL-VQA and D-VQA by around 3%.These results demonstrate the superiority of our DDG in the data-limited scenarios.
Effect of each component of our DDG.We conduct ablation studies on the VQA-CP v2 dataset to evaluate each component in our DDG, and show the experimental results in Table 3. From these results, we have the following observations: 1) when introducing the ensemble binary cross-entropy loss L ens with the positive samples, the model performance improves by around 5% compared with the UpDn (Anderson et al., 2018) model (i.e., 41.53% vs. 46.63%),which demonstrates the positive samples are able to assist the training process to alleviate the bias issue.2) By incorporating the KL loss L dis , the performance would be further improved (i.e., 46.63% vs. 47.77%), which highlights the ensemble prediction is able to guide the training of the original prediction.3) Upon introducing L neg and L weight , which leverage negative samples, the performance would improve substantially (i.e., 47.77% vs. 61.14%).This significant enhancement underscores the significance of promoting the attention of VQA models towards both vision and language modalities when answering the questions.
Table 3: Effect of each component of our DDG on the model performance.We show the results on the VQA-CP v2 dataset in terms of Accuracy (%).We use UpDn (Anderson et al., 2018) as the backbone model.These results further demonstrate the effectiveness of each component in our DDG.Effect of k. k denotes the number of target objects that are selected as the positive samples, which can be referred to in Section 3.2.To demonstrate the effect of the k on the model performance, we conduct experiments on the VQA-CP v2 dataset regarding different k.From the results in Table 4, we have the following observations: 1) with the increase of k (e.g., from 3 to 10), the model performance exhibits a gradual improvement.The results indicate that a higher value of k increases the likelihood that the positive image samples indeed encompass the target objects mentioned in the corresponding questions.2) Once k exceeds 10, the model performance starts to drop, thereby illustrating that an excessively high value of k introduces extraneous background information that adversely affects the model's performance.These results demonstrate that an appropriate k helps to obtain the best performance on our DDG.Evaluation of λ. λ is the weight of the loss L weight .We conduct ablation studies about different λ on the model performance, and show the experimental results in Table 5.From the results, the best performance is obtained in our DDG when λ = 0.05, and the performance is drop whenever λ is higher or lower than 0.05.These results demonstrate that a suitable weight of loss L weight helps to obtain better performance.et al., 2016) and UpDn (Anderson et al., 2018)) on the model performance.We report the experimental results on the VQA-CP v2 dataset in terms of Accuracy (%).† denotes the re-implementation of the baseline.
Evaluation of different backbones.Our DDG is model-agnostic.To demonstrate the effectiveness of our DDG on different backbones (i.e., SAN (Yang et al., 2016) and UpDn (Anderson et al., 2018)), we conduct experiments on the VQA-CP v2 dataset, and show the results in Table 6.From the results, our DDG consistently achieves a substantial improvement in model performance, regardless of which backbone it is.These results further embody the superiority of our DDG.

Conclusion
In this paper, we have proposed a novel method named DDG to alleviate the bias issues in VQA from vision and language modalities.Specifically, we construct both positive and negative samples in vision and language modalities without using additional annotations, in which the positive questions have similar semantics to the original questions, while the positive images contain foreground information.Based on the positive samples, we heuristically introduce the knowledge distillation mechanism to facilitate the training of the original samples through guidance from positive samples.Moreover, we put forth a strategy that encourages VQA models to focus more on the vision and language modalities when answering the questions, aided by the negative samples.Extensive experiments on the VQA-CP v2 and VQA v2 datasets show the effectiveness of our DDG.

Datasets
We conduct experiments on the VQA-CP (Agrawal et al., 2018) v2 and VQA (Goyal et al., 2017) v2 datasets.Specifically, the training set of VQA-CP v2 contains approximately 121k images and 483k questions, while the test set contains around 98k images and 220k questions.

Implementation Details
Following existing VQA methods (Anderson et al., 2018;Cadène et al., 2019b;Zhu et al., 2020), we extract the top-36 object features with a dimension of 2048 in each image by the Faster-RCNN (Ren et al., 2015) model that is pre-trained by (Anderson et al., 2018).Moreover, each question is first truncated or padded into the same length (i.e., 14), and then encoded by the Glove (Pennington et al., 2014) embedding with a dimension of 300.The dimension of the question encoder (i.e., single layer GRU (Cho et al., 2014)) is 1280.
Inspired by SSL-VQA (Zhu et al., 2020), we introduce one Batch Normalisation (Ioffe and Szegedy, 2015) layer before the classifier of UpDn (Anderson et al., 2018).We train our method for 30 epochs with the Adam (Kingma and Ba, 2015) optimiser.Specifically, we adopt L ens and L dis to train the baseline model for 12 epochs, and introduce L neg and L weight at the 13-th epoch.The learning rate is set to 1e-3, and decreases by half every 5 epochs after 10 epochs.The batch size is set to 256.We set k and λ to 10 and 0.05, respectively.We implement our method based on PyTorch (Paszke et al., 2019), and the model is trained with one Titan Xp GPU.Moreover, our method does not introduce additional parameters except the backbone model.
Note that our method is model-agnostic and can be applied to different backbones of VQA models.To better demonstrate the effectiveness of our DDG, we conduct experiments based on different backbones, including UpDn (Anderson et al., 2018), and SAN (Yang et al., 2016) in the same settings.Moreover, we perform experiments over three rounds using varying seeds, and present the results in terms of mean values.The source code and the pre-trained models are available at DDG.

Training Method
We provide the training method of our DDG in Algorithm 1. Specifically, when the training epoch is lower than the threshold τ , we forward the base model M b with the original and positive samples to calculate the knowledge distillation loss L dis and binary cross-entropy loss L ens .Moreover, when the training epoch is higher than threshold τ , we construct the negative image and question samples without introducing additional annotations, and then forward M b with the negative samples and obtain the loss L neg and L weight .Finally, we update M b based on the overall loss L.

More Ablation Studies
Effect of the training strategy.As shown in Algorithm 1, we adopt knowledge distillation loss L dis and ensemble binary cross-entropy loss L ens to train the base model for 12 epochs.To demonstrate the effectiveness of the training strategy, we conduct experiments about the training strategies on the VQA-CP v2 dataset, and the results are shown in Table 7. From the results, the strategy that trains the model with the KL loss in the whole training process performs worse than that training for 12 epochs.We infer that the KL loss may accelerate the fitting of the training dataset, which hinders the Algorithm 1 Training method of our DDG.Randomly sample a mini-batch data {(v i , q i , a i )} b i=1 from the training data, and obtain the corresponding positive samples {(v + i , q i , a i )} b i=1 , and {(v i , q + i , a i )} b i=1 .

Require
4: Forward M b with the training data and the positive samples, and then obtain the predictions P (A|v i , q i ), P (A|v + i , q i ), P (A|v i , q + i ), and the ensemble prediction P ens .

5:
Calculate the binary cross-entropy loss L vqa for P (A|v i , q i ) by Eq.( 2).Calculate the Knowledge Distillation Loss L dis based on P (A|v i , q i ) and P ens via Eq.(3).9: Calculate the binary cross-entropy loss L ens for P ens by Eq. ( 2).Randomly sample images and questions from the mini-batch data {(v i , q i , a i )} b i=1 to form the negative samples as {(v i , q i , a i )} b i=1 and {(v i , qi , a i )} b i=1 . 13: Forward M b with negative samples, and obtain the predictions P (A|v − i , q i ) and P (A|v i , q − i ). 14: Calculate the loss L neg based on the predictions of the negative samples via Eq.( 4). 15: Calculate the loss L weight based on the predictions of both positive and negative samples by Eqs. ( 6) and ( 7).Update M b by minimising the overall loss L (obtained via Eq.( 8)).18: end while training process of the negative sample losses L neg and L weight .
The objective of the KL loss is to make the two distributions close.In our method, we seek to make the predictions of the original samples approach to the ensemble predictions.Thanks to the generated high quality positive samples, the KL loss can improve the robustness of the VQA models (Refer to in Line 1-2 of Table 7), to some extent.However, if we adopt the KL loss in the whole training process, the VQA models will fit the data distributions excessively, and thus may hinder the training process of the negative sample losses.The experimental results in Table 7 also confirm it.For example, our DDG with the training strategy that introduces KL loss in the overall training process still performs better than that training using only (L dis and L ens ), but performs worse than that training the model using KL loss for 12 epochs.

As shown in
Table 8: Comparison with the state-of-the-art data augmentation based methods (e.g., SimpleAug and KD-DAug) on the VQA-CP v2 dataset.

More Visualisation Results
Qualitative results.We provide more visualisation results in Figure .4 to present the effectiveness of our DDG.From the results, our DDG localise the target objects more accurately than the UpDn (Anderson et al., 2018) model and SSL-VQA method, and thus makes a more correct prediction than the compared methods.These visualisation results demonstrate the effectiveness of our DDG.
Visualisation of the generated samples.As shown in Section 3.2, we have generated both positive image and question samples.To evaluate the generated methods, we provide some visualisation results about the augmented questions and selected target objects in Table 9 and Figure.5, respectively.From the results in Table 9, our augment questions have similar semantics to the original questions, which demonstrates our generated questions are reasonable as the positive samples.Moreover, although the baseline model UpDn (Anderson et al., 2018) trained on the VQA-CP (Agrawal et al., 2018)

Figure 2 :Figure 3 :
Figure 2: Qualitative comparison among UpDn(Anderson et al., 2018), SSL-VQA(Zhu et al., 2020), and our DDG on the VQA-CP v2 test set.For each example, we put the bounding box with the highest attention weight in the image and show the answers with the top-5 predictions.The bold, red answer is the ground-truth answer.
base model M b , batch size b, threshold τ .1: Randomly initialise the parameters of M b .2: while not converge do 3:

Figure 5 :
Figure5: We provide some visualisation results of the augmented images based on our generation technique referred to in Section 3.2.We put the bounding box with the top-3 attention weight of the pre-trained UpDn(Anderson et al., 2018) model in the image.

Table 2 :
Effect of different scales of the training data of VQA-CP v2 on the model performance.We report the results in terms of Accuracy (%).

Table 4 :
Effect of different k.We report the experimental results in terms of Accuracy (%).

Table 5 :
Effect of different λ.We report the experimental results in terms of Accuracy (%).

Table 7 :
Table 8, the VQA-CP v2 dataset comprises 438k training samples, while KDDAug, SimpleAug, and our DDG generate an additional Effect of the training strategy.We report the experimental results in terms of Accuracy (%)."L dis + L ens (all epochs)" denotes we additionally introduce L dis and L ens losses to train the UpDn model in the whole training process."DDG (KL for all epochs)" means we train the UpDn model with the DDG method, where the KL loss exists in the whole training process."DDG (KL for 12 epochs)" denotes we train the UpDn model with the DDG method, and the KL loss exists in the first 12 epochs.