A Multi-modal Debiasing Model with Dynamical Constraint for Robust Visual Question Answering

,


Introduction
Visual Question Answering (VQA) (Antol et al., 2015) is a challenging task spanning both computer vision and natural language processing. The goal of VQA is to infer the answer based on a given image and a textual question, which is generally cast as a classification problem. Promising results on test set whose distribution is analogous with the training set, such as VQA v2.0 (Goyal et al., 2017), are generally favorable. However, latest studies (Agrawal et al., 2016;Goyal * Jinan Xu is the corresponding author Figure 1: A straightforward way to mitigate the bias problem for VQA task is by subtracting the QA score from the VQA base score. et al., 2017) have pointed out that many welldeveloped VQA models merely over-exploit the language prior from the training set to provide correct answers without reasoning. That is, the answer prediction might rely more on the correlation to the question and less on the image. For instance, in the VQA-CP v2.0 (Agrawal et al., 2018) training set, the answers of the question with the first few words "how many · · · " are usually "2", and the answers of the specific question "what ′ s the color of the bananas?" are almost all "yellow". Consequently, significant drops (Agrawal et al., 2018) are demonstrated while handling with the out-of-distribution test dataset.
Recently, solutions for this problem can be categorized into two classes, namely, nonaugmentation-based methods (Cadene et al., 2019;Ramakrishnan et al., 2018;Wu and Mooney, 2019;Selvaraju et al., 2019;Clark et al., 2019;Jing et al., 2020;Niu et al., 2021) and augmentation-based methods (Chen et al., 2020;Gokhale et al., 2020;Liang et al., 2020;Teney et al., 2020). The former seeks to weaken language bias or leverage visual grounding to increase the image dependency, while the latter aims to to balance the dis-tribution of training data. Moreover, most of the advanced debiasing methods still suffer from two issues, namely comprehensive bias detecting (Wen et al., 2021) and In-Distribution (ID) generalizability problems (Niu and Zhang, 2021).
In this work, firstly we explore a very straight-  Table 1: Debiasing effect on VQA-CP v2.0 dataset when applying our subtracting debiasing strategy (i.e., S V QA -S QA ) to Updn (i.e., S V QA ). The best score is in bold.
forward solution to VQA bias, which is shown in Figure 1. Generally, we try to design different strategies for learning and inference via a VQA base model and a question answering (QA) model, beforehand. In the training procedure, these two models are separately optimized, and let S V QA , S QA denote the VQA base score and QA score, respectively. In the inference procedure, we calculate the debiased result by subtracting the S QA from S V QA . We apply such a simple strategy to the popular VQA model Updn (Anderson et al., 2018) on VQA-CP v2.0 dataset, and we find that the overall accuracy gains from 39.80% to 51.49% as shown in Table 1. Despite its remarkable performance, we still identify the above two major limitations in this strategy, which specifically reflect the following two aspects: • First, the model answers questions without comprehensively exploiting vision bias. in the left indicate the impact of the bias related to visual side, where the salient "elephant" object leads to wrong answers. According to our statistics, most of the irrelevant wrong answer "elephant" appear in "what" type questions, while the biased answers might be different in the questions belonging to "how many" type, such as Figure 2 (c). Therefore, both language and visual modalities might jointly bring about bias.
• Second, strong uncertainty exists in the final score, since the base model and bias model are optimized separately. That means the model cannot guarantee that the correct answer has the highest score after subtracting the bias effect. For this reason, the model still suffers from inadequate debiasing and overcorrection problems, which has been shown in the right part of Figure 2.
To solve the above problems, we propose a Multi-modal Debiasing model with Dynamical Constraint (MDDC). For the first limitation, we construct two bias learning branches. Inspired by the way of using the single modality to identify unimodal-specific bias, we adopt a question-only branch for language bias. Unfortunately, such a strategy is unsuitable for the bias issue related to visual information. The reason is that the same image is usually used to answer various types of questions, thus the model cannot obtain the specific vision bias that involves necessary information to answer the question, but only an image-to-answer distribution bias. We assume that a more effective way is to provide some question clues for images to generate question-specific vision bias. Following this assumption, we design a special bias learning branch by incorporating prompts extracted from questions into an image-only answering model.
For the second limitation, we propose a dynam- ical constraint loss to reduce the difference of the amount of information (Ramakrishnan et al., 2018) between the VQA base module and the bias module.
In this way, we dynamically subtract the bias score according to the degree of bias. Therefore, we mitigate the problem of uncertain inference caused by the separate optimization of these two modules.
We evaluate the proposed MDDC on VQA v2.0 and VQA-CP v2.0 benchmarks. Experimental results on both datasets demonstrate that our debiasing strategy is competitive compared with mainstream baselines.

Related Work
Visual question answering has witnessed great progress, while a growing body of work (Agrawal et al., 2016;Goyal et al., 2017) has pointed out the drawbacks of reasoning ability and bias affect. In this section, we review recently proposed VQA debiasing approaches, which can be generally fall into non-augmentation-based methods and augmentation-based methods.

Non-augmentation-based Methods
One of the strategies is to introduce prior knowledge (i.e., human visual and textual explanations) to strengthen the visual grounding for VQA model. HINT (Selvaraju et al., 2019), SCR (Wu and Mooney, 2019) are proposed with a self-critical training objective that ensures the correct answers to match important image regions with the help of human explanations. Another common solution (Ramakrishnan et al., 2018;Cadene et al., 2019) is to design ensemble-based models, which adds an auxiliary QA model to identify bias. Ramakrishnan et al. (2018) propose an adversarial regularization method between the VQA base model and the question-only branch to overcome language bias.
RUBi (Cadene et al., 2019) also leverages the QA model to capture language bias when unwanted regularities are identified. Wen et al. (2021) use both question-to-answer and vision-to-answer models to generate bias representations of two modalities. Niu et al. (2021) design a novel counterfactual inference framework to reduce language bias by subtracting the direct language effect from the VQA total causal effect. Guo et al. (2022) propose a loss re-scaling way to assign different weights to each answer according to the training data statistics.

Augmentation-based Methods
Recently, studies automatically generate additional question-image pairs to balance the distribution of training data. Chen et al. (2020) propose a method, CSS, to produce massive counterfactual samples by masking the critical objects and words. Mutant (Gokhale et al., 2020)  (2022) design a knowledge distillation-based answer assignment to generate pseudo answers for each image-question pairs. However, it is important to note that the VQA-CP is proposed to evaluate whether the VQA model can distinguish between visual knowledge and language prior. Therefore, we expect that the model can be robust enough to make debiased inference under biased training.

Our Approach
In this section, we first describe the general architecture of our proposed MDDC model and then give the details for each component. Figure 3 depicts the overview of our approach, which consists of three major modules: (1) the standard VQA base module, which aims to indicate the probability belonging to each answer candidate; (2) the bias module, which aims to capture biases combining both questions and images simultaneously; (3) the dynamical constraint module, which aims to dynamically control the final prediction distribution.

Standard VQA Base Module
Given a dataset D = {(v i , q i , a i )} N i=1 which contains N samples, we define the i-th image v i ∈ V, the i-th question q i ∈ Q, and the i-th answer a i ∈ A. A standard VQA module is defined as: where e v (·) and e q (·) denote image and question encoders, respectively, f V QA (·) represents the mapping function which is learned to project the multimodal feature to the answer space, σ(·) is the sigmoid function.

Bias Module
At the heart of our system is the design to obtain bias distributions. To make use of this intuition, we capture the language bias by using a QA model, as well as the vision bias by incorporating question clues into a vision-to-answer-only model.

Language Bias Learning
Language bias stands for the prior that produces the answer only according to the given question. For example, given a question q i , we denote the language bias answer probability as: where f Q (·) is a linear function to map the question representation to the answer space.

Question-guided Vision Bias Learning
We introduce a question-guided vision bias learning module for VQA debiasing, which is shown in Figure 4. Since merely using visual information is hard to obtain more targeted bias, a more flexible way is to guide images to generate answers with the intent and concepts of questions. The intent-level clues provide semantic enhancement on individual images, manifesting the goal of the question in a global view. Additionally, the concept-level clues can supplement more semantics to images, where the concept refers to a set of entities mentioned in the question. Here, we compute answer probability predicted by the question-guided vision bias module as follows: where a t , q t and c stand for the answer type, question type and concepts of the question, respectively, f F (·) is the function to combine these components and map the fusion representation to the answer space. Concretely, we fuse a t and q t via a gate mechanism to obtain the question intent vector which is later added with each image region embedding. Then, a multi-layer self-attention (Vaswani et al., 2017) is adopted to make interactive learning for the image features incorporated with intent clues. Finally, we get the vision bias output via a concept attention or average pooling operation. Note that the concept attention is a normal attention mechanism using c as the query to weight the image regions. However, we assume that not all questions are suitable for using concepts. For example, as for the other type question "W hat color is the apple?", it might be easy to answer "red" if the concept "apple" and intent "what color" are provided. But for the number type questions, they are still hard to be answered even though given the intent and concept. Thus we only apply concept attention to number type questions, and employ average pooling operation on yes/no and other type questions.

Dynamical Constraint
As mentioned above, it is necessary to build connection between the standard VQA base module and bias module. In this subsection, we introduce a dynamical constraint loss L D to control the final distribution subtracted by the bias probability. Denote B = {b 1 , . . . , b M } as the set of features extracted from M bias modules. We define s as the feature outputted from the VQA base module. Afterwards, L D is computed as: where N , A are the number of samples and the number of candidate answers, respectively, β is a dynamic control coefficient, I(X|Y ) represents the amount of information of X under the condition of Y . The goal of L D is to decrease the uncertainty of the standard VQA module prediction and increase the probability uncertainty of the bias module according to the degree of bias. As for the former, it helps the VQA base module learn adequate knowledge disrupted by bias. As for the latter, it prompts the bias module to compute the appropriate bias score, for that different samples have varying degrees of impact from bias. Note that both the VQA probability score p(a j |s) and the bias probability score p(a j |b) satisfy the Bernoulli distribution since the sigmoid function is applied to the final output layer. Therefore, L D is different from the Kullback-Leibler divergence (Doersch, 2016). More details are explained in Appendix A.

Training and Inference
Training. In the model training phase, we separately optimize the standard VQA module and bias module via the binary cross-entropy loss bce(·), which is defined as: where w is a hyper-parameter to balance the base and bias components, and y is the target label. Then, the final loss function is computed as L = L B +λL D , where λ is the discount coefficient. Additionally, we stop the gradient backpropagation of the bias module to the language encoder and vision encoder in order to prevent the VQA base module from updating in a biased direction.
Inference. At the inference stage, the final score for the j-th answer ∆p(a j ) is distinct according to different answer types, which is defined as: where t is the answer type (e.g., yes/no, number, and other), and α t stands for the weight of t, which satisfies the condition of M k=1 α t k = 1.

Experimental Settings
Dataset. We conduct experiments on VQA-CP v2.0 dataset (Agrawal et al., 2018), which is proposed to evaluate the debiasing ability. Besides, we also validate the performance on VQA v2.0 (Goyal et al., 2017), to see the generalization ability on ID dataset. For both datasets, the questions are divided into three categories: yes/no, number and other.  Table 2: Important hyper-parameters list, where a t k = {a t 1 , a t 2 } stands for the weight combinations of the language bias branch (i.e., a t 1 ) and the question-guided vision bias branch (i.e., a t 2 ) in Equation (7); α y , α n , α o are severally denoted as the weights of the three question types, namely yes/no, number and other.
Metric. Following previous work (Antol et al., 2015), the standard evaluation metric in VQA challenge is adopted, which is computed as: Acc(ans) = min 1, #humans provided ans 3 (8) where the humans provided ans is the number of each answer that human annotated for question. Hyper-Parameters and Environment. Optimal hyper-parameters are chosen via grid search. All the embeddings of question clues are randomly initialized. The intent extraction model is trained by fine-tuning BERT (Devlin et al., 2019), and the concepts are extracted by entity recognition tool. We use the Pytorch 1.40 framework to implement our model. All computations are done on NVIDIA Tesla V100 GPUs. Other important hyper-parameters are listed in Table 2.

Tested Backbones
We mainly implement our approach on two VQA backbones, namely Updn (Anderson et al., 2018) and LXMERT (Tan and Bansal, 2019). Updn. the most popular VQA baseline, which firstly employs the pre-trained object detection model (Ren et al., 2015) to obtain features of salient image regions. LXMERT. a multi-modal pre-training framework based on a cross-modality encoder from Transformers. In our experiments, we separately divide this backbone into two groups depending on whether loading pre-trained weights or not.

Baselines
We compare our model with existing mainstream bias reduction techniques, which can be grouped as   Table 4: Ablation study on VQA-CP v2.0 and VQA v2.0 Datasets. † denotes our implementation, and * stands for using the LXMERT model structure without loading multi-modal pre-trained weights. The best score is in bold and the second best is underlined.

Results
The results on VQA-CP v2.0 and VQA v2.0 are reported in Table 3.
Results on VQA-CP v2.0. Overall, our method achieves the best performance on VQA-CP v2.0 dataset compared with non-augmentation approaches. Drilling down to the question type, our method also gets competitive results. Specifically, we achieve the second-best results on yes/no type question, and the best results on other type question. It is worth noting that our strategy obtains improvements of 6.90% and 4.13% across number type and other type question compared with CF-VQA (Niu et al., 2021) which also employs a subtracting way to reduce bias effect. We infer the reason is that our model detects more comprehensive biases from both language and vision aspects, and our dynamical constraint loss also plays a role in adjusting the final distribution. To sum up, these results not only show the effectiveness of our approach for reducing bias problem but also the value of the performance on ID dataset.

Ablation Study
An ablation experiment would be informative to analyze the effects of the dynamical constraint loss (denoted as + L D ), and the bias learning strategy, which can be taken apart as language bias learning (denoted as + b l ), and question-guided vision bias learning (denoted as + b v ). For fairness, all the models are trained under the same settings. Table 4 lists the results on two datasets. It can be seen that coupling with all the components does really helpful on VQA-CP v2.0 (Agrawal et al., 2018), and can narrow the drop gap on VQA v2.0 (Goyal et al., 2017). When there is no multi-modal pre-trained knowledge (e.g., Updn (Anderson et al., 2018) and LXMERT * (Tan and Bansal, 2019)), on OOD dataset (i.e., VQA-CP v2.0), we find + b v  brings significant improvement on number type question. A possible reason might be that the bias in number type question is severely affected by images with language information (e.g., intent and concept) on VQA-CP v2.0. When leveraging pretrained weights into LXMERT, there are still slight improvements brought by all the components. On the whole, our bias learning strategy (+ b l + b v ) can detect more comprehensive biases than individual + b l , and it narrows the performance gap on ID dataset (i.e., VQA v2.0). Fortunately, integrating the dynamical constraint improves the result across all base models on OOD dataset, and L D does not have a significant negative impact on ID dataset. To conclude, it is always preferable to use all the components (+ b l + b v + L D ), due to the superior performance. This proves the effectiveness of our bias learning strategy and dynamical constraint.

Impact of Question Clues
In the following set of experiments, we demonstrate the effectiveness of the question clues mentioned in our question-guided vision bias learning module on VQA-CP v2.0. Note that we only change three components (i.e., image, intent, concept) based on the overall Updn + MDDC model. As depicted in Table 5, we conclude that both the image and the question clues are necessary for debiasing. Concretely, leveraging intent feature is helpful for other type question, based on which incorporating concept information via a conceptattention mechanism boosts the performance of number type question from 12.27% to 19.93%. Such a phenomena indicates that the base model Updn might easily overfit the training set of VQA-CP v2.0 dataset and learn less valid knowledge for number recognition ability.

Impact of Layer Number
We further investigate the layer number of selfattention (Vaswani et al., 2017)   vision bias learning module on VQA-CP v2.0 test split. Figure 5 shows the change of accuracy on the test set as the layer number increases, which is based on Updn model. A proper number of layers can make the model perform well on number type questions, which again verifies the effect of our vision bias learning strategy, while accuracies on the rest items are more stable. We find that the best results can be obtained when the number is equal to 3, and further numbers do not provide a significant performance improvement.

Qualitative Analysis
Debiasing qualitative examples on VQA-CP v2.0 are shown in Figure 6. By inspecting the results, we can further verify that our debiasing approach can address more comprehensive biases and dynamically adjust the final score. As illustrated in Figure 6, MDDC can successfully mitigate the biasd inference on all kinds of question types. For the example at the first row, MDDC overcomes the bias related to both question and image sides (i.e., "white", "elephant"). The two examples at the second row manifest the effects of MDDC on number and yes/no type questions. Another example based on Updn model in Figure 7 further illustrates the benefit brought by the dynamical constraint loss L D . Specifically, L D helps to increase the difference between the VQA score and the QA score corresponding to the answer of "no", and it narrows the score gap of the wrong answer (i.e., "yes"), which promotes the final score of the correct answer to be the highest.

Conclusion
A robust visual question answering model with dynamical constraint is proposed for reducing as much multi-modal bias as possible. Compared with previous researches, we investigate a very straightforward way to obtain debiasing effect by subtracting bias score from VQA base score. On one hand, we design a language bias learning branch and a question-guided vision bias learning branch to detect comprehensive biases. On the other hand, a dynamical constraint loss is proposed related to the two bias branches to alleviate the over-correction and insufficient debiasing problems to some extent. Experimental results on VQA-CP v2.0 and VQA v2.0 datasets demonstrate the effectiveness of our proposed approach from both quantitative and qualitative perspectives.

Limitations
Our model introduces additional parameters in the question-guided vision bias module, compared with other methods. Moreover it is also worth exploring whether the question-guided vision bias module can improve number type questions in other OOD data sets.

A A Theoretical Explanation
This section presents an approximately feasible theoretical explanation for our dynamical constraint L D . According to the definition, the total loss L (L = L B + λL D ) can be combined like items and further simplified. For easier explanation, we extract the loss item of answer a i ∈ A from a single sample, namely L(a i ) to illustrate, which is computed as: We assume the bias can be reflected as: the score of a common answer is extremely high or too low in the training phase, which affects the selection of the correct answer when evaluating on test set.
Here, we consider two boundary cases, depending on whether the target label of the current answer is 1 or 0 (i.e., y = 1 or y = 0). If y i = 1, both the VQA score p(a i |s) and the bias score p(a i |b) are optimized to 1. On this condition, L(a i ) can be transformed to: bias learning (10) Since β i = p(a i |s), when β i → 1, the learning procedure of VQA base model is further boosted while the bias learning is inhibited. Due to the extremely long-tailed answer distribution, the training objective can be unbalanced across different answers. It indicates that the unbalanced bias knowledge might be overlearned from more samples during training. Thus, if the sample is severely biased, the bias score tends not to decrease much after suppression, and the VQA base score will be relatively less reserved after subtracting (as shown in Figure 8 A).
If y i = 0, both p(a i |s) and p(a i |b) are optimized to 0, and the L(a i ) can be reduced to: bias learning (11) Intuitively, the term −λβ i log p(a i |s) inhibits p(a i |s) → 0, thus it can prevent the model from overfitting the training set to some extent. In addition, the item +λβ i log p(a i |b) boosts p(a i |b) → 0. Therefore, when the VQA base score of a wrong answer is high (i.e., B), the process of adjusting the prediction scores is similar to the condition of y = 1 (as shown in Figure 8 B).
In this way, during inference procedure, the final score of the biased answer might be more likely to decrease, while the unbiased answer tends to retain a relatively higher final score. In summary, such a strategy can help to prevent the model from overfitting the training set, and dynamically obtain a more appropriate final score.

B Illustrative Examples
In order to fully demonstrate the specific role of the dynamic constraint loss, we deliver more illustrative examples to show the probability predictions for each branch, as shown in Figure 9. We choose Updn as the backbone, and all the results are obtained under the same experimental settings.
For the cases in Figure 9, we find that when L D is not added (w/o L D ), strong uncertainty exists in the prediction results. The reason is that the VQA branch and the bias branches are trained separately, causing debiasing effect to be less significant in certain cases. By contrast, we explicitly introduce L D to the model (w/ L D ) and thus obtain satisfactory results. Prediction (w/ L D ) Figure 9: Examples on VQA-CP v2.0 test split to verify the effectiveness of the dynamical constraint loss, specifically for the standard VQA score, language bias score (i.e., QA score) and question-guided vision bias score (i.e., Qguided VA score). In the top five rankings, the correct answer scores (in red) and incorrect answer scores (in blue) are given, where scores marked in dark red or dark blue stand for the scores corresponding to the answers that are changed more crucially after incorporating L D .

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
the section: Limitation A2. Did you discuss any potential risks of your work?
the section: Limitation A3. Do the abstract and introduction summarize the paper's main claims?
Abstract; Section 1: Introduction A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Left blank.

B1. Did you cite the creators of artifacts you used?
Left blank.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Left blank.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
Left blank.

C Did you run computational experiments?
Section 4: Experiment C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Limited by number of pages The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.