Causal Reasoning through Two Cognition Layers for Improving Generalization in Visual Question Answering

Generalization in Visual Question Answering (VQA) requires models to answer questions about images with contexts beyond the training distribution. Existing attempts primarily refine unimodal aspects, overlooking enhancements in multimodal aspects. Besides, diverse interpretations of the input lead to various modes of answer generation, highlighting the role of causal reasoning between interpreting and answering steps in VQA. Through this lens, we propose Cognitive pathways VQA (CopVQA) improving the multimodal predictions by emphasizing causal reasoning factors. CopVQA first operates a pool of pathways that capture diverse causal reasoning flows through interpreting and answering stages. Mirroring human cognition, we decompose the responsibility of each stage into distinct experts and a cognition-enabled component (CC). The two CCs strategically execute one expert for each stage at a time. Finally, we prioritize answer predictions governed by pathways involving both CCs while disregarding answers produced by either CC, thereby emphasizing causal reasoning and supporting generalization. Our experiments on real-life and medical data consistently verify that CopVQA improves VQA performance and generalization across base-lines and domains. Notably, CopVQA achieves a new state-of-the-art (SOTA) on the PathVQA dataset and comparable accuracy to the current SOTA on VQA-CPv2, VQAv2, and VQA-RAD, with one-fourth of the model size.

However, real-life multimodal data diversity poses a challenge for VQA models to achieve Out-of-Distribution (OOD) generalization, which involves performing well on data beyond the training distribution instead of relying on independent and identically distributed (iid) data [Zhang et al., 2021, Goyal and Bengio, 2022, Kawaguchi et al., 2022].Recent studies have highlighted a risk of OOD generalization in VQA, where models may respond solely based on the question and ignore the input image due to correlations between the question and answer distribution [Niu et al., 2021, Wen et al., 2021] that exist in human knowledge.For example, questions starting with "Is this...?" are typically expected to be yes/no questions (e.g. , "Is this a cat?") rather than multiple-choice questions (e.g. , "Is this a cat or dog?"), leading to possible correct answers with a simple "yes" or "no".
There are many attempts to solve this issue, such as (1) reducing the linguistic correlation [Niu et al., 2021, Wen et al., 2021] by avoiding answers generated from only the question used as input, (2) strengthening the visual processing [Yang et al., 2020], and ( 3) balancing the answer distribution by generating new image-question pairs [Chen et al., 2020, Gokhale et al., 2020, Si et al., 2022].However, these approaches tend to overlook the critical aspect of enhancing multimodal predictions, instead focusing on unimodal aspects (either language or visual) or the data itself.Our assumption is that enhancing the quality of multimodal predictions would be a potential route for improving generalization in VQA.
In fact, solving the VQA task requires the integration of multimodal processing and common sense knowledge.Consequently, VQA can be conceptualized as a two-stage process: input interpreting and answering, which involves generating an interpretation of the multimodal input and answering the question by querying the knowledge space.Besides, similarly to how humans tackle the VQA task, the input misunderstanding can harm the answering stage, posing a risk to the overall performance of VQA.Therefore, comprehending the causal reasoning behind this two-stage process becomes crucial to improving the VQA task.
In this work, we propose Cognitive pathways VQA (CopVQA) to boost the causal reasoning in VQA to enhance the OOD generalization.
CopVQA derives from the findings of knowledge modularity and cognitive pathways from cognitive neuroscience [Baars, 2005, Kahneman, 2011, Goyal and Bengio, 2022], which mentions (1) the human brain organizes independent modules to handle distinct knowledge pieces and (2) the communication efficacy among these modules supports the generalization.We decompose each interpreting and answering stage into a set of experts and a cognition-enabled component (CC) strategically activates one expert for each stage at a time.By this approach, CopVQA disentangles the VQA task into specialized experts connected by pathways through interpreting-answering process.Subsequently, we extend the biases in VQA that are also from the monolithic procedure (instead of expert selection) besides the linguistic correlation.Finally, we emphasize answers governed by the disentangled stages with multimodal input and disregard other answers, including ones from the monolithic procedure and from unimodal input.
The contributions of this work are summarized as follows: (1) we propose CopVQA to improve OOD generalization by enhancing causal reasoning that is compatible with diverse VQA baselines and domains; (2) to our best knowledge, CopVQA is the first work that formulates VQA as two layers of cognitive pathways, further facilitating research on causal reasoning in VQA; and (3) we achieve the new SOTA for the PathVQA dataset and mark the comparable results to the current SOTAs of VQA-CPv2, VQAv2, and VQA-RAD datasets with only one-four of the model sizes.

Related Work
The generalization restriction in VQA arises from biases between questions and answers in human knowledge, where certain question types exhibit strong correlations with predictable answers based on common knowledge.These biases are also reflected in VQA datasets.For example, in the VQAv1 dataset [Antol et al., 2015], a VQA model can quickly achieve an accuracy of around 40% on sport-related questions by simply providing the answer "tennis."To further challenge the generalization ability of the VQA models, the VQA Changing Priors (VQA-CPv2) dataset [Agrawal et al., 2017] is introduced.This dataset is deliberately designed to feature different answer distributions between its train and test sets.Therefore, the emergence of the VQA-CP dataset has brought about a significant shift in the VQA landscape, demanding that VQA models go beyond exploiting correlations and overcoming biases from the given training data.
Among attempts of capturing language biases, RUBi [Cadene et al., 2019] introduces a noteworthy approach that computes the answer probability using only the input question, thereby capturing linguistic biases.Then, the question-only branch carrying the biases is used to compute a mask that aims to mitigate biased prediction of unimodal input from the multimodal prediction.Building upon this line of research, CFVQA [Niu et al., 2021] pioneers a comprehensive causal-effect view of VQA that subtracts the impact of the question-only branch, represented as the answer logit, from the overall answer logit (further discussed in Section 3).Similarly, DVQA [Wen et al., 2021] proposes subtracting the question-only branch's impact from the multimodal branch in the hidden layer.By doing so, DVQA aims to mitigate biased predictions before passing the output through the classifier model for final answer prediction.
In this study, we delve deeper into capturing biases using separate branches and subsequently eliminate them.However, besides the linguistic correlations, we assume another potential restriction in VQA generalization is the monolithic approach across the diverse scenarios of the multimodal input.Therefore, we attempt to eliminate answers by monolithic procedures involved in interpreting and answering stages parallelly emphasize answers by disentanglement approach.
3 Preliminaries: Causality view in VQA Introduced in CFVQA [Niu et al., 2021], the causal graph of VQA is depicted in Figure 1a in which the inputs V and Q cause an answer A, and a mediator1 K represents the knowledge space.We have the direct paths, which are Q → A and V → A, representing the answers based solely on unimodal With a single mediator K, the idea of CFVQA is to capture the language prior by the Q → A branch, which involves only the question and produces an answer logit Z q .Likewise, the answer produced through K with multimodal input is Z k .Subsequently, they produce the final answer logit as follows, where c is a parameter for normalization: Regarding optimization, they utilize Cross-Entropy loss (CELoss) and define L = CELoss(logσ(Z k + Z q ), a) + CELoss(logσ(Z q ), a).

Cognitive pathways VQA -CopVQA
In this section, we present CopVQA from multiple viewpoints, which are the overview in Section 4.1, the causal-effect view of CopVQA in Section 4.2, and the CopVQA implementation in Section 4.3.

CopVQA overview
The CopVQA, depicted in Figure 2, serves as a VQA backbone that emphasizes causal reasoning in multimodal prediction.CopVQA mitigates the negative consequences of disregarding causal reasoning, yet enhancing generalization.Initially, we explore the knowledge space as (1) multimodal knowledge for interpreting the multimodal input, and (2) commonsense knowledge for answering based on the interpretation obtained.Additionally, assuming that different interpretations of the input lead to diverse ways of answering, we perceive the use of monolithic procedures for interpreting and answering across diverse multimodal input as a potential bias that hampers generalization, extending beyond the linguistic correlations discussed in prior work.Consequently, we define a non-biased approach as one that commits to integrating the multimodal input and strategically considering proper knowledge pieces for interpreting and answering, rather than relying on monolithic procedures.
Each intepreting and answering stages involves diverse distinct experts with a cognition-enabled component (CC) that activates one expert at a time.Consequently, we define a complete reasoning flow as selecting an appropriate expert pair for the interpreting-answering process.In contrast, incomplete reasoning flows rely on monolithic procedures or utilize unimodal input.Finally, we attempt to emphasize the prediction obtained through the full reasoning flow and disregard the ones from incompleted reasoning flows.

CopVQA from causal-effect view
To establish a solid connection between the comprehensive overview and implementation details of CopVQA, we present the causal-effect view of the proposed architecture in this section.
The causal view of CopVQA presented in Figure 1b, contains the input pair (V, Q) that leads to an answer A, controlled by the two sets of cognitive pathways denoted as mediators Cop 1 and Cop 2 .Specifically, we have direct paths: Q → A and V → A; and indirect paths including Case 1: Specifically, any path that does not involve Cop 1 (Q → A, V → A, and Q → Cop 2 → A) are categorized as unimodal paths, as it does not involve multimodal interpretation.Likewise, indirect paths that bypass Cop 2 (e.g.(V, Q) → Cop 1 → A) are considered monolithic ones2 , as it does not involve expert selection for both interpreting and answering.Finally, CopVQA emphasizes the effect from completed reasoning flow, which is and eliminates effects of incompleted reasoning flows, including unimodal and monolithic paths.

Implementation Details
We design cognitive pathways as a Mixture of Experts [Jacobs et al., 1991] (MoE), which disentangles a task into specialized experts.Mathematically, an MoE setup M contains (1) N experts {E 1 , E 2 , . . ., E N } with distinct parameters and (2)  gating model G : R d → R N .As described in Equation 1, with input x ∈ R d , the output y is the sum of results from N experts weighted by g, produced by G.We conduct g = Gumbel-max(G(x)) with a Gumbel-max to achieve a 1-hot-like probability.

Computational Flow in CopVQA
We describe the computational flow in CopVQA by introducing the notations and 2-stage process.
Given the pre-processed question q ∈ R dq and image v ∈ R dv , where d q and d v are the shapes of the modalities, the VQA model aims to predict the answer â, which is an index in the vocabulary set size of V .In the interpreting stage, CopVQA produces an interpretation as a d i -dimensional vector, denoted as i ∈ R d i .In the answering stage, CopVQA conducts a classifier model that outputs Z jk ∈ R V , where j, k ∈ {0, 1}.Specifically, j is 1 when the interpretation i is governed by experts from Cop 1 in interpreting, and 0 otherwise (governed by a monolithic procedure).We analogously define k ∈ {0, 1} for Cop 2 in the answering stage.Obtaining the answer â can be done by getting the max value's index of logσ(Z jk ).
Notation 1: Denote M I and M A as MoE setups align Cop 1 for Interpreting and Cop 2 for Answering, respectively.N 1 experts in M I is designed to map R dq → R d i responding to the interpreting stage; similarly, N 2 experts in M A map R d i → R V responding to the answering stage.The cognition-enabled components align to G(•) in each MoE, which are G I : R dq → R N 1 and G A : R d i → R N 2 for the interpreting and answering stages, respectively.
Notation 2: We denote monolithic procedures W 1 M : R dq → R d i and W 2 M : R d i → R V for the interpreting and answering stages, respectively.
Stage 1 -Input interpreting: Let i mul ∈ R d i be the multimodal interpretation and i q ∈ R d i be the unimodal interpretation, described in Equation 2. Specifically, i mul is computed by passing q through Cop 1 and subsequently applying a Fusion function (follow the baselines, details in Section 5.3) on v to obtain a multimodal interpretation.Likewise, i q is obtained from the input question q alone by W 1 M .
In this step, we compute the pool of outputs from multiple reasoning flows.Let Z 11 , described in Equation 3, align to the full causal reasoning flow governed by both cognition layers.Likewise, Z 10 , as in Equations 4, represents the monolithic path that involves only Cop 1 .Finally, Z 01 and Z 00 , formulated in Equations 5 and 6, represent for the output from unimodal paths. (5)

Training and Inference Time
Output finalizing As discussed in Section 4.2, CopVQA emphasizes the impact of the fully causal reasoning flow and eliminates the impacts of the incompleted reasoning flows.Inspired by the Niu et al. [2021] as introduced in Section 3, we design strategies for output finalizing, loss functions, and answer finalizing for the inference time in Equations 7, 8, and 9, respectively, with a is the target answer and the LossFn function is inherited from particular baselines.
L total = LossFn(logσ(Z 11 +Z 10 +Z 01 +Z 00 ), a) +LossFn(logσ(Z 00 ), a) Model architecture Experts in the Model of Experts (MoE) framework have a common architecture but carry distinct parameters.Denote a layer l : R in → R out with a hidden size h in a particular baseline that is in charge of interpreting or answering stages, we design l ′ : R in → R out with the hidden size h ′ .Subsequently, individual experts and the monolithic models in CopVQA share the same architecture as l ′ .Precisely, h ′ is fine-tuned to achieve the optimal result.Through experimentation, we have discovered that selecting values for h ′ such that the total number of parameters in all experts and monolithic models does not exceed the number of parameters in l tends to yield optimal results.This observation regarding the adjustment of h ′ aligns with the principles of knowledge modularity, which we discuss in more detail in Appendix C. As a result, CopVQA does not increase parameters beyond those present in the baseline model.

Experiment Setup
We conducted experiments to validate: H 1 -Causal reasoning in multimodal prediction benefits VQA performance (Section 6.1), H 2 -Causal reasoning in multimodal prediction enhances OOD generalization (Section 6.1), and H 3 -The disentangled architecture is crucial for reasoning (Section 6.2).
To examine H 2 , we investigate results on the VQA-CPv2 dataset, a valuable benchmark for assessing OOD generalization in VQA.This dataset features substantial variations in answer distribution per question category between the training and test sets, making it an ideal choice for evaluating the models' ability in OOD scenarios.

Baselines
We implement and compare CopVQA to baselines that do not emphasize the causal reasoning in multimodal prediction.In VQA-CPv2 and VQAv2, we implement CopVQA on CFVQA [Niu et al., 2021] and DVQA [Wen et al., 2021] baselines that attempts to eliminate the language priors.Besides, to fulfill the comparison, we compare CopVQA to approaches that strengthen visual processing: SCR [Wu and Mooney, 2019]; balancing answers distribution: Mutant [Gokhale et al., 2020] 3 .In medical datasets, we conduct CopVQA on PathVQA's SOTA -MMQ [Binh D. Nguyen, 2019], which leverage significant features by meta-annotation.

Implementation details
We fine-tune N 1 and N 2 with each baseline and maintain consistent configurations, including the optimizer, number of epochs, and batch sizes.For the DVQA baseline, the best setting for the (N 1 , N 2 ) pair is (3, 3).In contrast, for the CFVQA and MMQ baselines, the optimal pair is (5, 5).
CFVQA employs Fusion as the Block factory fusion from Ben-Younes et al. [2019] and trains by CrossEntropy loss.Likewise, based on the UpDn [Anderson et al., 2017], DVQA designs Fusion as the multiplication of the pre-processed v and q and uses BinaryCrossEntropy as LossFn.On the other hand, the baseline MMQ, based on BAN [Kim et al., 2018], designs Fusion as the recurrent multiplication of the processed v, q, and BAN's result on v and q, and utilizes the BinaryCrossEntropy loss.

Quantitative Results
Overall, CopVQA outperforms all baselines in both iid and OOD.Specifically, Table 1 presents results on VQAv2 and VQA-CPv2, and Table 2 shows results on PathVQA and VQA-RAD.Baselines marked with a " † " are reported results from the original paper, while others are our reproduced mean and standard error on 5 random seeds.The "Gap" shows the improvement of CopVQA from the reproduced baseline, with a "-" for unreported results.In addition,

Tr a in
Te s t Tr a in Te s t is comparable to MMBS [Si et al., 2022], the current SOTA of VQA-CPv2 and VQAv2, with only one-fourth of the model size (details in Table 4 in Appendix).Specifically, MMBS gains 68.39% and 69.43% Overall scores on VQA-CPv2 and VQAv2, respectively, with over 200M parameters, while DVQA+CopVQA marks 67.9% and 67.5% with less than 52M parameters.Lastly, CopVQA marks 1.9 points lower than CLIP [Eslami et al., 2021] current VQA-RAD's SOTA.
Compared to other approaches in VQA-CPv2 and VQAv2, Table 1 indicates that CopVQA is significantly better than SCR.Specifically, CopVQA achieves a comparable accuracy or even higher than Mutant without data augmentation.

Qualitative Results
We investigate H 3 by the underlying performance of CopVQA on VQA-CPv2 and VQA-RAD.We denote CopVQA based on DVQA and CFVQA on VQA-CPv2 as CopVQA(D) and CopVQA(C), and based on MMQ on VQA-RAD as CopVQA(M).
Discussion on the debiased samples  ity over CFVQA as it removes the strongly biased answers "2" in Figure 4a and "right" in Figure 4b.

Ablation Studies
Individual effects in CopVQA We conduct ablations to observe the effect of components in Equations 7. The analysis confirms the dominance of CopVQA over the modified versions.Figure 6 compares the Overvall score of Z f inal to ones from: (1) Z 11 , Z 10 , and Z U (the sum of Z 01 and Z 00 ).Values of Z 10 are significantly lower than those of Z f inal , supporting the assumption that disentangled architectures benefit VQA.
(2) !Z 11 , !Z 10 and !Z 00 denote results of Z f inal when Z 11 , Z 10 , or Z U is excluded from Z f inal in inference, respectively.It indicates the importance of monolithic and unimodal paths when !Z 10 , !Z 01 , and !Z 00 are far from Z f inal .
(3) Z rand f inal as the score of Z f inal when experts are randomly selected in inference.Z rand f inal drops dramatically, proving the vital role of selecting a proper pair of experts to improve accuracy.(4) Z X f inal as the score of Z f inal when we break the causal structure in Figure 2   rate v in computing Z 01 , or designing Z X 01 = Cop 2 (Fusion(W 1,X M (q), v)).Z X f inal is comparable but does not exceed Z f inal in all models, proving the pivot of the original causal structure design and the balance of monolithic and unimodal paths.We collect scores from the best checkpoint for CopVQA(M) and Z rand f inal , and from the best score over epochs for CopVQA(D) and CopVQA(C).
Number of experts in Cop 1 and Cop 2 Figure 7 indicates results from multiple (N 1 , N 2 ) pairs.We observe that values in the range of 3 to 5 experts tend to yield higher scores, with the pair of duplicated values achieving the highest scores, while pairs with values of 2 and 10 result in significantly lower scores.This finding suggests that an appropriate balance between experts in two sets of pathways is crucial to acquire high performance.

Conclusion
We proposed CopVQA, a novel framework to improve OOD generalization by leveraging causal reasoning in VQA.CopVQA emphasizes the answer with a full reasoning flow governed by disentangled knowledge space and cognition-enabled component in both interpreting and answering stages while eliminating answers in incompleted reasoning flows, which involve unimodal input or monolithic procedures.CopVQA outperforms baselines across domains in iid and OOD.Notably, CopVQA achieves new SOTA on PathVQA and comparable results with the current SOTAs of other datasets with significantly fewer parameters.

Limitations
Despite being touted as a robust backbone that enhances generalization in the VQA task through its emphasis on causal reasoning in multimodal processing, the CopVQA method exhibits certain limitations that should be acknowledged, including: • Sensitive in data with limited occurrences such as brand names or country names, which poses challenges for effective debiasing.We further discuss this point in Appendix B.2.
• Require careful finetuning for each baseline and dataset, involving tuning hyperparameters like N 1 , N 2 , and experts' architecture to achieve optimal performance.A Further discussion on experiment details A.1 Datasets The VQA-CPv2 and VQAv2 are two popular datasets in the VQA domain.We follow a standardized downloading process and input preprocess of CFVQA for CopVQA(C) and from DVQA for CopVQA(D).This ensures consistency and comparability across different approaches.Similarly, in the case of PathVQA and VQA-RAD from the MMQ baseline, similar guidelines are followed to ensure a standardized experimental setup.In addition, Table 3 indicates the training, validation, and test splits of each datasets.

A.2 Validation along epochs
In Figure 8, the validation results indicate that the proposed CopVQA outperforms the baseline models DVQA and CFVQA by a significant margin.
The performance of CopVQA is noticeably better, showcasing its effectiveness in visual question answering.Additionally, when compared to the MMQ baseline, CopVQA demonstrates comparable performance, with a slight improvement.These findings highlight the superiority of CopVQA in addressing the task of VQA, surpassing existing baseline models and exhibiting promising potential in the field.

A.3 The role of disentangling ability
To verify the role of the disentangling ability of experts, we conduct an ablation that adjusts the number of experts to be activated by G of both Cognitive pathways and in both training and inference time.Particularly, we modify the G to achieve a k-hot-like instead of 1-hot-like probability.The comparison in Figure 9 demonstrates the dominant performance of CopVQA with only one expert activated by each set of pathways.Likewise, increasing k to 2 or 3 significantly drops VQA accuracy, with a broader range of errors.This analysis results confirm the essential role of the disentangling ability of each expert that learns distinct knowledge to enhance the performance and reserve the robustness of the proposed CopVQA.

B Further discussion on qualitative analysis B.1 Debiased samples
Figure 10 presents cases where CopVQA correctly answers questions that other baselines fail to answer correctly.The primary reason for this success is the improved understanding of the question's purpose by CopVQA.It effectively addresses biases and avoids potential pitfalls that lead to incorrect answers.Additionally, in VQA-CPv2, CopVQA takes into consideration the relationships between related answers, which further enhances its accuracy and robustness.For instance, the answers "outdoor" and "indoor" have a high probability in the third sample, while baselines consider "yes" and "no" as other potential answers.Overall, the qualitative analysis demonstrates the effectiveness of CopVQA in mitigating biases and improving the performance of VQA systems.
Figure 11 shows cases where all models' predictions are correct.It is observed that in the majority of these cases, the proposed method CopVQA generates a higher probability for the correct answer compared to baselines.The higher probabilities assigned by CopVQA reflect its ability to capture important visual cues, contextual information, and semantic relationships in disentangled architectures, thereby outperforming the baseline models in terms of answer quality.This analysis highlights the superior performance of CopVQA in producing reliable and accurate answers in VQA.     Figure 12 shows cases where all models' predictions are incorrect.It is observed that a significant portion of these cases falls into two categories: complex counting tasks and ambiguous answers.Complex counting tasks often involve intricate arrangements or a large number of objects, which pose challenges for the models in accurately counting and identifying the objects.The inherent complexity of these tasks can lead to errors in the predictions of all models, including CopVQA.Additionally, cases with ambiguous answers present difficulties for the models, as the correct answer may vary depending on the interpretation or subjective judgment.For example, the second sample can be categorized in both complex counting tasks and ambiguous answers where (1) the image including many people with dark clothes in a dark background, (2) the input is the duplication of 4  images, since (3) the target answer likely to only rely on 1 image piece.In such instances, all models struggle to provide accurate responses, resulting in incorrect predictions across the board.These findings emphasize the existing limitations and challenges in VQA systems when it comes to intricate counting tasks and handling ambiguous answers.

B.2 Answers Distribution
Figure 13 indicates the analysis of answer distribution in VQA, across the train, test sets, and predictions, where all models successfully overcome the biases.Firstly, it becomes evident that biases exist in the VQA-CPv2 dataset as the answer distribution in the train and test sets significantly differ.However, despite these challenges, all models demonstrate successful debiasing.Notably, CopVQA outperforms the baseline models by producing a test set distribution that aligns more closely with the ground truth, indicating a notable improvement in its ability to generate likely answers.
Figure 14 shows cases where all models fail to perform well and biased answers persist in the predictions, several observations can be made.Firstly, there are strong biases present, indicated by the stark disparity between the answer distributions in the train and test sets.This implies that the models have not effectively generalized from the training data to handle the biases present in the test set.Additionally, the models may struggle with providing accurate answers for categories that have limited occurrences, such as specific brand names that appear only a few times in the dataset.These challenges highlight the need for further improvement in handling biases and addressing rare answer categories to enhance the overall performance of VQA models.

C Design principles inspired from Knowledge modularity
In the context of machine learning, modularization refers to the decomposition of complex systems or models into smaller, more manageable modules [Kahneman, 2011, Lindauer et al., 2019, Zhang et al., 2020, Kwon et al., 2019, Sayyed and Kulkarni, 2021, Garibaldi, 2021, Goyal and Bengio, 2022].In the realm of knowledge modularity and machine learning, this principle suggests that breaking down complex tasks or models into modular components can lead to more effective and efficient learning.approach is that it allows for the development of specialized modules that can be individually optimized (learning independently by Gumbel-max activation function in CopVQA), leading to improved performance on specific subtasks.It also promotes reusability, as modular components can be shared or combined to address related tasks or domains.Furthermore, modularization in machine learning facilitates interpretability and explainability.Since each module focuses on a specific aspect of the task, it becomes easier to understand and analyze the contributions of each module to the overall decision-making process.This transparency can be particularly valuable in domains where interpretability is crucial, such as healthcare or autonomous systems.

Figure 1 :
Figure 1: The conventional causal-effect view of the VQA task and the proposed Cognitive pathways VQA, which enhances the causal reasoning aspect The causal graphs of four different kinds of reasoning flows and the corresponding computational flows in CopVQA Visualization of interpreting-answering processes through monolithic and disentangled knowledge spaces in computing Z00 and Z11.

Figure 2 :
Figure 2: CopVQA from the causal-effect view aligned to computational flows.CopVQA emphasizes the full causal reasoning flow (green), simultaneously eliminating effects from monolithic (blue) and unimodal (red) flows.

Figure 3 :
Figure 3: Sample of debiased cases, listing the labels with top probability.The green and red labels are the correct and incorrect answers, respectively.

Figure 4 :
Figure 4: Answer distributions on VQA-CPv2 and the comparison of debiasing ability.CopVQA produces the most similar distributions as in the test set.

Figure 5 :
Figure 5: The proportion of experts selection from Cop 1 and Cop 2 assignment during the inference time.

Figure 6 :
Figure 6: Ablation studies to clarify the individual effect of components in CopVQA design.The original design CopVQA demonstrates the dominant performance over all variants.
Appendix B discusses more on models' success and failure cases in the debiased answer distribution.Discussion on mechanisms selection Figure 5 presents the proportion of expert-pair selection from Cop 1 and Cop 2 over N 1 × N 2 combinations.CopVQA consistently assigns the non-overlapping of experts to separate answer categories.For instance, in CopVQA(D), Cop 1 assigns experts 2, 1, and 3 for the Y/N, Number, and Other types, respectively, in most of cases.This pattern aligns with the strategy of CopVQA, where experts specialize in distinct contexts of the multimodal input and commonsense knowledge space.Moreover, we observe that CopVQA allot more than one answering expert for Number type by a visible proportion, which adapts to the diverse skills required, explaining the significant improvement of CopVQA on this type.

Figure 7 :
Figure 7: Overall accuracy comparison over various configurations of N 1 and N 2 in CopVQA.

YFigure 8 :
Figure 8: Comparison of validation results during the training process.

Figure 9 :
Figure 9: Ablation study on the essential role of the disentangling ability of experts.

Figure 10 :
Figure 10: Examples of cases effectively debiased by CopVQA over baselines.The green labels are the correct answers while the red labels are the incorrect answers.

Figure 11 :
Figure 11: Examples of cases where all models successfully answer the questions.The green labels are the correct answers while the red labels are the incorrect answers.

Figure 12 :
Figure 12: Examples of cases that all models failed to remove biases.The green labels are the correct answers while the red labels are the incorrect answers.

Figure 14 :
Figure13: Answers distributions on VQA-CPv2 in the cases that all models are successful.

Table 2 :
Accuracy comparison on PathVQA and VQA-RAD datasets.The best scores are bolded.

Question: Is this a real dog or a toy dog? Correct answer: Toy
Table 4 in the Appendix compares the sizes and training time between the models.Proving H 1 , CopVQA reveals the consistent improvement in all answer categories regardless of baselines and domains.Notably, in VQAv2, CopVQA achieves a +1.5% and +3.2% higher than CFVQA and DVQA baselines, respectively, in the Overall score.In medical datasets, CopVQA marks the improved ability in the medical VQA task with a +2.3 and +3.4 points higher MMQ baseline.Proving H 2 , regardless of the based models, CopVQA demonstrates the effectiveness of causal reasoning in generalization by outperforming baselines in VQA-CPv2 by a large margin.Specifically, CopVQA acquires a remarkable improvement of +2.8% in CFVQA-based and +6.8% in DVQAbased in the Overall score.Notably, CopVQA shows exceptional performance on the Number type, with +20.3 and +12.7 points increased from the CFVQA and DVQA baselines, respectively.Compared to current SOTAs, CopVQA achieves a new SOTA on PathVQA with +2.3 points higher than MMQ, the previous SOTA.Besides, CopVQA

Table 3 :
Dataset indication." * " denotes the summary in the train set only."-" marks the not reported information."K" stands for one thousand.

Table 4 :
Comparison of model size and proportion of average training duration over 5 runs.All models are trained on the same single NVIDIA RTX A6000.

Question: What number is on the bus? Correct answer: 160
The advantage of this modular