Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

Despite the excellent performance of vision-language pre-trained models (VLPs) on conventional VQA task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnet-works. To this end, we systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our experiments involve 3 VLPs, 2 compression methods, 4 training methods, 2 datasets and a range of sparsity levels. Our re-sults show that there indeed exist sparse and robust subnetworks, which are competitive with the debiased full VLP and clearly outperform the debiasing SoTAs with fewer parameters on OOD datasets VQA-CP v2 and VQA-VS. 1


Introduction
Visual Question Answering (VQA) (Antol et al., 2015) is an important task at the intersection of CV and NLP.In the last decade, deep neural networks have made promising progress in VQA.However, recent studies (Agrawal et al., 2016;Manjunatha et al., 2019) have found that VQA models are prone to dataset biases.As a result, they always suffer from sharp performance drops when faced with outof-distribution (OOD) test datasets, whose answer distributions are different from the training set. 1 The codes can be found at https://github.com/PhoebusSi/Compress-Robust-VQA.and "lxmert(lmh) + mask train(lmh)", respectively, with modality-specific sparsity.
Although large-scale vision-language pretrained models (VLPs) achieve further improvements in the in-distribution (ID) VQA benchmark (Goyal et al., 2017), they also fail to address the dataset-bias problem (Agrawal et al., 2018), e.g., lxmert (Tan and Bansal, 2019) suffers a 23.26% drop between ID and OOD accuracy.At the same time, the improvement brought by VLPs is partly due to their large model size, which increases the computational cost of deploying VQA models.To facilitate the application of VLPs to VQA tasks, the two problems should be addressed simultaneously.However, existing researches mostly focus on each of them separately.
The dataset-bias problem in VQA is well studied by numerous debiasing methods based on conventional small-scale models (Anderson et al., 2018;Cadene et al., 2019).Their main solution (Cadene et al., 2019;Clark et al., 2019;Liang et al., 2021b;Mahabadi and Henderson, 2019) is to regularize the loss according to the bias degree of training samples.In terms of the increased computational cost, a line of recent efforts have been made to compress pre-trained language models (PLMs) in the NLP field (Chen et al., 2020b;Li et al., 2020a,b;Liang et al., 2021a;Liu et al., 2021Liu et al., , 2022;;Prasanna et al., 2020) and VLPs for visual-linguistic tasks (Fang et al., 2021;Gan et al., 2022).They show that large-scale PLMs and VLPs can be compressed into lightweight models without degrading performance.Refer to App.A for more related work.
This paper jointly studies the compression and debiasing problems of VLP for the VQA task.To this end, we combine the existing debiasing and pruning methods to establish a training and compression pipeline, and conduct extensive experiments with the pre-trained lxmert, which is the most popular VLP in VQA, under different OOD settings.We show that there exist sparse lxmert subnetworks that are more robust than the full model, which suggests that the goal of OOD robustness and computational efficiency can be achieved simultaneously.
We also present a comprehensive study on the design of the training and compression pipeline, as well as the assignment of sparsity to different model modules, to identify subnetworks with better OOD generalization.Our findings highlight the importance of 1) Employing a two-stage training and compression pipeline and integrating the debiasing objective throughout the entire process.2) If there are two debiasing methods working well with the full model, training the full model with the relatively poor-performing one and compressing it with the better one.3) Assigning modality-specific sparsity to different modules of VLP.
Our main contributions are as follows: (1) We present the first (to our knowledge) systematic study on sparsity and OOD robustness for VLPs.
(2) Our empirical studies on the training and compression pipeline and sparsity assignment can serve as a valuable guideline for the future design of VLP subnetwork searching methods.(3) We obtain subnetworks that outperform existing debiasing So-TAs in terms of the trade-off between accuracy and model size on OOD datasets VQA-CP v2 and VQA-VS (see Fig. 1, Tab. 1 and Tab. 2).

Method 2.1 VLP Architecture and Subnetworks
This section takes lxmert as an example to introduce how we extract subnetworks.Lxmert contains an embedding layer, a visual fc layer, a pooler layer, a VQA-specific classifier and a stack of Transformer layers, which involve three encoders: lan-guage encoder (L enc ), object relationship encoder (R enc ) and cross-modality encoder (C enc ).
We adopt unstructured pruning to obtain a compressed version (i.e., a subnetwork) of the original VLPs.Specifically, given a VLP f (θ) with parameters θ, we apply a binary pruning mask m ∈ {0, 1} |θ| to the model parameters, which gives rise to f (m⊙θ), where ⊙ is the element-wise product.The parameters to be pruned are: where W emb , W vis-fc and W plr are the weights of embedding layer, vision fc layer and pool layer, θ Lenc ∪ θ Renc ∪ θ Xenc are the parameters of Transformer layers.More details of lxmert can be found in App.B.1.Another model visualBERT (Li et al., 2019), which is also used in our experiments, will be introduced in App.B.2.
Magnitude-based Pruning approximates the importance of model parameters based on their absolute values and eliminates the less important ones.We adopt the basic version of magnitude-based pruning, i.e., one-shot magnitude pruning (OMP).OMP can optionally be combined with further finetuning of the pruned subnetwork to recover the performance drop.
Mask Training directly optimizes the binary pruning mask m towards the given objectives.Specifically, each weight matrix W ∈ R d i ×do is associated with two mask matrices, namely a binary mask m ∈ {0, 1} d i ×do and a real-valued mask m ∈ R d i ×do .In the forward propagation, m is computed from m through binarization: where ϕ is the threshold.Then, the original weight matrix W is replaced with a pruned one m ⊙ W.
When it comes to backward propagation, we follow (Liu et al., 2022;Mallya et al., 2018;Radiya-Dixit and Wang, 2020;Zhao et al., 2020) and use the straight-through estimator (Bengio et al., 2013) to estimate the gradients of m using the gradients of m, and then update m as m ← m − η ∂L ∂m , where η is the learning rate.
We initialize m according to the magnitudes of the pre-trained weights of lxmert.This strategy is shown to be more effective than random initialization for pre-trained language models (Liu et al., 2022;Radiya-Dixit and Wang, 2020) and we also validate this in our experiments with lxmert (see App. C.2). Specifically, m is initialized as: where α ≥ 1 is a hyper-parameter.At initialization, we set the threshold ϕ = 0.01 (any other value with the same order of magnitude should also be fine).
To ensure that the subnetwork satisfies the given sparsity, ϕ is re-computed every t m training steps.

Debiasing Methods
The deabising methods in VQA usually contain a main model and a biased model.The biased model, which learns the language bias, is used to measure the training samples' bias degree and adjust the training loss for the main model.We experiment with SoTAs debiasing methods, i.e., LMH (Clark et al., 2019), RUBi (Cadene et al., 2019) and LPF (Liang et al., 2021b), of which LMH is widely studied for the OOD scenario of VQA (Chen et al., 2020a;Liang et al., 2020;Si et al., 2021) and NLU (Jia and Liang, 2017;McCoy et al., 2019;Schuster et al., 2019;Zhang et al., 2019).For comparison, we also describe the binary cross-entropy here.Binary Cross-Entropy (BCE) computes the cross-entropy between the predicted distribution p m (from main model) and the soft target score of each ground-truth t, as: where δ denotes the sigmoid function.
Learned-Mixin +H (LMH) adds a biased model to learn biases during training, as follows: where p b and p m are the predicted distribution of biased model and main model, respectively.g(h) determines how much to trust the learned biases, based on lxmert's last hidden representation h.Following (Clark et al., 2019), we directly use the answers' frequency under each question type as p b2 .To prevent p b from being ignored, LMH also adds an entropy penalty item R in the final loss: RUBi adopts a training strategy similar to LMH to regularize the main model's probability, and uses standard cross-entropy as the training loss: LPF measures the bias degree as α k = p b [a k ] to regularize the loss of the main model: where the γ is a tunable hype-parameter.

Problem Formulation
Given the pre-trained lxmert f (θ pt ), our goal is to find a subnetwork f (m ⊙ θ f t ) that satisfies a target sparsity level s and maximizes the OOD performance: ) where E OOD denotes OOD evaluation, ∥∥ 0 is the L 0 norm and |θ pr | is the total number of parameters in θ pr .This goal is achieved by searching the optimal m and θ f t in model training and compression.
Eq. 9 only specifies the overall sparsity.In this work, we also explore a finer-grained control over sparsity, which allocates different sparsity to different modules of lxmert, given that the overall sparsity is satisfied.Concretely, we consider three modules from different modalities, i.e., the language module, the visual module and the cross-modality module.The constraint in the optimization problem is then rewritten as3 : the language module, visual module and crossmodality encoder, respectively.m Lan , m V is and m X are the binary masks for the three modules, respectively.s L , s R and s X are the target sparsity levels for the three modules, respectively.
If not otherwise specified, we set the sparsity of every weight matrix to target sparsity.For example, if s = 70% and there is no modality-specific constraint, then all weight matrices are at 70% (uniform sparsity).If s L = 50%, then all weight matrices in θ Lan are at 50% sparsity, while s R and s X could be different (modality-specific sparsity).

Training and Compression Pipeline
We define two notations: F L (f (θ)) denotes training f (θ) using loss L ∈ {L bce , L lmh }.P p L (f (θ)) denotes pruning f (θ) using method p ∈ {OMP, mask train} and loss L (if applicable), which outputs a pruning mask m.A typical training and compression pipeline involves three stages: Stage1: Full Model Fine-tuning.The pretrained lxmert f (θ pt ) is fine-tuned using loss L, which produces f (θ f t ) = F L (f (θ)).
Stage2: Model Compression.The fine-tuned lxmert f (θ f t ) is compressed and we get the subnetwork f (m ⊙ θ f t ), where m = P p L (f (θ f t )).Stage3: Further Fine-tuning (optional).The subnetwork f (m ⊙ θ f t ) is further fine-tuned using loss L ′ , and gets f (m ⊙ θ

Experiments
In this section, we mainly investigate three questions: (1) How does compression affect lxmert's OOD generalization ability?(2) How to design the training and pruning pipeline to achieve a good sparsity-performance trade-off?(3) How to assign sparsity to different modality-specific modules?

Datasets, Model and Implementation
We conduct experiments on the OOD benchmarks VQA-CP v2 (Agrawal et al., 2018) and VQA-VS (Si et al., 2022b) that evaluate the robustness of VQA systems, with the accuracy-based evaluation metric (Antol et al., 2015).A more detailed discussion of the difference between the two datasets is shown in Sec.3.5.We thoroughly study the above three questions on VQA-CP-v2, which is widely used in the literature on debiasing VQA systems (refer to Sec. 3.2, 3.3 and 3.4 ).Then, based on the findings, we further explore the more challenging VQA-VS (Si et al., 2022b) (refer to Sec. 3.5 ).For VLP, we adopt the lxmert-base-uncased model (Tan and Bansal, 2019) released by huggingface (Wolf et al., 2020).All the results are averaged over 4 random seeds.More information of the model and implementation details are shown in App.B.4.

Effect of Compression on OOD Accuracy
Subnetworks from BCE Fine-tuned lxmert.We compress the BCE fine-tuned lxmert using OMP and mask training and introduce either L bce or L lmh in the pruning (for mask training) or further fine-tuning process (for OMP).
The results are shown in the upper row of Fig. 2. We can derive several observations: 1) When no debiasing methods are used, the subnetworks of "mask train(bce)" and "OMP + bce ft" improve over the full lxmert by 1.35% ∼ 2.79%, even at up to 70% sparsity.This implies that lxmert is overparameterized and pruning may remove some parameters related to the bias features.2) "mask train(lmh)" and "OMP + lmh ft" achieve further performance boost, exceeding full lxmert by a large margin (11.05% ∼ 14.02%).Since mask training does not change the value of parameters, the  results of "mask train (lmh)" implicate that the biased "full lxmert(bce)" already contains sparse and robust subnetworks (across 10% ∼ 90% sparsity).
3) "mask train" outperforms "OMP" in general, which suggests that directly optimizing the subnetwork structure is more effective than debiasing a compressed subnetwork by further fine-tuning.
Subnetworks from lxmert Fine-tuned with Debiasing Methods.From the lower row of Fig. 2, we can find that: 1) For the full lxmert, the OOD performance is obviously promoted with the LMH debiasing method.2) Unlike lxmert(bce) subnetworks, lxmert(lmh) subnetworks do not exhibit significant improvement over the full model.However, the "mask train(lmh)" and "OMP + lmh ft" subnetworks, which preserve the lxmert(lmh)'s performance at up to 50% sparsity, can serve as an efficient alternative to the LMH fine-tuned full lxmert.3) "mask train(bce)" and "OMP + bce ft" clearly underperform their lmh counterparts, which suggests that it is important to use the debiasing method in pruning and subnetwork further finetuning even when the full model is already trained with the debiasing method.Fig. 3 compares the subnetworks fine-tuned with LMH, LPF and RUBi.We find that: The subnetworks found using LMH consistently outperform those found by LPF and RUBi across different sparsity levels.Therefore, to save computing resources, we mainly use the best performing LMH in the following experiments and analysis.

Training and Compression Pipeline
In this section, we study the proper design of the training and compression pipeline, under the basic framework described in Sec.2.5.Here we focus on the mask training compression method, as it has been shown to generally outperform OMP with further fine-tuning.Our main observations can be described from three perspectives: First, it is recommended to introduce the debiasing loss across Stage1, Stage2 and (if applicable) Stage3.The reason is three-fold: 1) As shown by Fig. 4, the subnetworks at 10%, 30% and 70% sparsity levels have better performance when starting from lxmert(lmh), as compared with the lxmert(bce).At 90% sparsity, "lxmert(lmh) + mask train(lmh)" underperforms "lxmert(bce) + mask train(lmh)" (see App. C.3 for reasons), but the Accuracy gap is small.Therefore, adopting L lmh in Stage1 is a better choice than L bce , especially when the subnetworks are not at extremely high sparsity.2) As we discussed in the previous section, introducing L lmh in the mask training process (Stage2) substantially outperforms L bce for both lxmert(lmh) and lxmert(bce).3) When both Stage1 and Stage2 adopt the BCE loss, further finetuning the subnetworks with LMH loss in Stage3 can significantly boost the performance, as shown by the results of "lxmert(bce) + mask train(bce)" w/o ft and w/ lmh ft in Fig. 4.
Second, Stage3 is unnecessary if it adopts the same training objective as Stage2.Comparing the blue and red (or cyan) bars in Fig. 4, we can see that further fine-tuning with the same training objective generally degrades the performance of "lxmert(lmh) + mask train(lmh)", "lxmert(bce) + mask train(lmh)" and "lxmert(bce) + mask train(bce)".This phenomenon suggests that Stage3 can be eliminated to save computation cost.
Third, it is recommended to use different debiasing methods in the two stages and leave the better one to Stage2.As shown in Fig. 5, although LPF and RUBi are less effective in debi-  asing the full model than LMH, "lpf+lmh" 4 and "rubi+lmh" are superior to "lmh+lmh".In contrast, when reversing the debiasing methods used in the two stages, "lmh+rubi" and "lmh+lpf" exhibit worse performance, suggesting that the better debiasing method should be used in Stage2.Additionally, "lpf+lmh" is superior to "rubi+lmh", which indicates that using a better debiasing objective in Stage1 is helpful when we have multiple choices different from the Stage2 objective.We also experiment with another VLP model, visualBERT (Li et al., 2019), and find that "lpf+lmh" still performs the best as in Fig. 7.

Modality-specific Sparsity
Pruning Each Single Modality-specific Module.
Since lxmert uses different modules to encode the multi-modal data, it is intuitive to hypothesize that different modules of lxmert may capture the language bias to different extents.To validate this hypothesis, we compress the language, visual and cross-modality modules, respectively.As presented 4 "lpf+lmh" denotes "lxmert(lpf) + mask train(lmh)" by Fig. 6, the compression of different modalityspecific modules indeed exhibits different effects.
When the full model is lxmert(bce) (the orange and cyan lines), compressing the language or crossmodality module has a positive effect on the OOD performance, and the accuracy generally improves as sparsity increases from 10% to 90%.By contrast, compressing the visual module results in inferior results than compressing the other two modules, even if the number of remaining parameters is larger (note that the visual module has a smaller number of parameters than the other two modules).These results suggest that, for the biased lxmert(bce), the language and cross-modality modules capture more training set bias than the visual module, which supports the above hypothesis.
In terms of "lxmert(lmh) + mask train(lmh)" (the red line), although compression does not lead to performance improvement like compressing lxmert(bce), the results also demonstrate that the language and cross-modality modules are more compressible than the visual module.
Searching for Appropriate Modality-specific Sparsity.Motivated by the above findings, we search for appropriate modality-specific sparsity by performing mask training with a variety of sparsity configurations (see App. C.4) for the three modules while keeping the overall sparsity the same.
As we can see in Fig. 8, at 50% and 70% overall sparsity, the configuration that achieves the best result assigns slightly higher sparsity to the language and cross-modality modules and significantly lower sparsity to the visual module, as compared with uniform sparsity.This phenomenon is in accordance with the findings in Fig. 6, implicating that compressing the three modules uniformly is suboptimal (at 50% ∼ 70% sparsity) and the language and cross-modality modules should be compressed to  a larger extent than the visual module.At 90% sparsity, the sparsity configuration's comfort zone is in the proximity of the uniform point.Further increasing the sparsity of the language and crossmodality modules result in performance decline or only minor improvements.This is because 90% sparsity already approaches the compression upper bound, even for the language and cross-modality modules.
Fig. 9 shows a more direct comparison between the uniform and modality-specific sparsity.We also introduce another baseline "matrix-specific sparsity", which ranks all the model parameters, instead of the parameters in each weight matrix.This also results in different sparsity levels for different weight matrices, while there is no explicit control over the modality-specific sparsity.We can see that modality-specific sparsity achieves the best results across the three overall sparsity levels from 50% to 90%, demonstrating its superiority.Besides, the results also suggest that, although simply allowing different matrices to have different sparsity is more flexible than uniform sparsity, it is not conducive to the final performance.

Exploration on VQA-VS
VQA-CP v2 is widely used in the literature of debiasing VQA systems.However, it only considers the question-type-based bias.To account for other potential biases, VQA-VS constructs several types of OOD test sets according to different shortcuts (e.g., keyword and key object).As a result, VQA-VS is more challenging and allows us to analyze the results on different biases.In this section, we search sparse and robust lxmert subnetworks in VQA-VS based on the major findings obtained from VQA-CP v2.
The Effect of Compression.Fig. 10 shows the results of full lxmert and subnetworks on VQA-VS.We can see that: 1) When using the BCE objective, we can identify sparse "bce+bce" subnetworks that are comparable with full lxmert (bce).2) Different from VQA-CP v2, full lxmert (lmh) only slightly outperforms full lxmert (bce) in the OOD setting of VQA-VS, and underperforms in the ID setting.
3) The "lmh+lmh" 5 subnetworks improve over full lxmert (lmh) on both ID and OOD test sets, across a wide range of sparsity levels, suggesting that lxmert can also be simultaneously compressed and debiased on VQA-VS.
The Effect of Modality-specific Sparsity.Fig.
10 also shows that compressing different modalityspecific modules has different effect on VQA-VS, as in VQA-CP v2.The language module is the most compressible while compressing the visual module results in the sharpest performance decline.
To compare modality-specific sparsity and uniform sparsity, we directly inherit the sparsity configuration selected in Sec.3.4 on VQA-CP v2.Fig. 11 shows that modality-specific sparsity consistently outperform uniform sparsity, except for 90% sparsity in the ID setting.

Comparison with Debiasing SoTAs
In this section, we will compare the best training and compression solutions identified in the previous sections with the current SoTA debiasing methods.
Tab. 1 shows the results on VQA-CP v2.We find that: The accuracy of our methods (10% lxmert and 30% lxmert) beats the previous non-VLP debi-5 Since most debiasing methods (e.g., LPF and RUBi) fail on VQA-VS (see Tab.2), we only use LMH in VQA-VS.However, combining LMH and other effective debiasing methods in different stages may further outperform "lmh+lmh", as found in VQA-CP v2.We leave it for future work.We also add experiments on a more recent VLP mPLUG (Li et al., 2022).We adopt the base version of mPLUG, fine-tune it on the VQA-CP v2 training set and then conduct pruning using mask training.Since mPLUG formulas VQA as a text generation task, we adopt the LPF debiasing method.Note that LMH and RUBi cannot be directly applied to debias text generation models, because they are designed for classification loss over a fixed number of classes.As shown in the bottom rows of Tab. 1, the mPLUG trained with standard cross-entropy (CE) loss can be simultaneously compressed (to 50%) and debiased (+5.48 Acc).The mPLUG trained with LPF debiasing loss can also be compressed to 50% with a slight accuracy decline.These results demonstrate that the findings and techniques present in our work can be generalized to more advanced VLPs.
Results on VQA-VS are presented in Tab. 2. We can observe that: 1) Our methods "bce+bce" 10% lxmert and "lmh+lmh" 30% lxmert outperform all the non-VLP debiasing methods in both ID and OOD settings, with similar or fewer parameters.2) Except for LMH, other debiasing methods underperform BCE in OOD-mean.LMH improves the OOD accuracy at the cost of ID accuracy decline.3) The "lmh+lmh" subnetworks (even with 50% remaining parameters) obviously improve the ID performance of lxmert (lmh) and retain comparable OOD performance.4) Compared with "bce+bce", the OOD advantage of "lmh+lmh" outweighs its ID disadvantage at 50% to 90% parameters.With fewer remaining parameters, the overall performance of "bce+bce" is superior.

Conclusion
To facilitate the application of VLP-based VQA systems, this paper presents the first joint study on the compression and debiasing problems of VLP for the VQA task.Through extensive experiments with three VLPs (i.e., lxmert, visual-BERT and mPLUG), we analyze the impact of compression on the OOD generalization ability.We present a comprehensive study on the design of the training and compression pipeline for a good sparsity-performance trade-off, and provide some valuable findings about the assignment of sparsity to different modality-specific modules.The compressed lxmert subnetworks in this paper outperform the SoTA debiasing methods with fewer or similar model parameter counts.

Limitations
Although we have empirically verified that the adoption of modality-specific sparsity is beneficial for the search for more robust subnetworks, our work still does not provide a solution on how to determine the optimal sparsity assignment effectively and efficiently.We invite follow-up studies to further address it in future work.

A More Related Work
A.1 Overcoming Dataset Bias in VQA Most VQA systems heavily rely on the information of the question to predict answers no matter the content of the given image.That is they learned the language biases in datasets.They are not robust and always perform poor in the OOD setting where the language biases they learned in training set are invalid for test set.To promote the development of models that overcome such problem, VQA-CP v2 (Goyal et al., 2017) is proposed and has become the standard OOD benchmark in VQA.Currently, the widely used debiasing methods can be roughly grouped into non-data-augmentation (Clark et al., 2019;Liang et al., 2021b;Mahabadi and Henderson, 2019) and data-augmentation methods (Chen et al., 2020a;Gokhale et al., 2020).The former applies a biased model (trained with question only) to regularize the model training and thus prevent learning from question.The latter generates samples to balance the training data and directly erase the biases in the training set.However, the augmented data also increase the training cost, and overcoming the language-bias problem remaining the original dataset biases unchanged still remains a major challenge (Liang et al., 2021b;Niu et al., 2021).Thus, we only focus on non-data-augmentation methods, such as LMH (Clark et al., 2019), RUBi (Cadene et al., 2019) and LPF (Liang et al., 2021b).Very recently, VQA-VS6 (Si et al., 2022b) is proposed to explore the varying types of dataset biases.We also use this dataset to study how the training and compression pipeline affect different dataset biases.

A.2 Vision-Language Pre-trained Models
Recently, VLPs (Dou et al., 2022;Li et al., 2021Li et al., , 2020a;;Wang et al., 2021a,b;Zhang et al., 2021;Si et al., 2023;Li et al., 2022) based on the Transformer backbone (Vaswani et al., 2017) have achieved encouraging success.Specially, OFA (Wang et al., 2022) and Florence (Yuan et al., 2021) establish the SoTA on the in-distribution VQA v2.To learn better cross-modality representations and vision-language alignment, they are trained with large-scale pre-training data and generally have huge model capacity.Among them, lxmert (Tan and Bansal, 2019) is the most widely used VLP as the backbone model in VQA field (e.g., some dataaugmentation debiasing methods (Gokhale et al., 2020;Si et al., 2021;Wen et al., 2021) and the open-domain VQA (Marino et al., 2019) method MuKEA (Ding et al., 2022)).In this paper, we therefore mainly use lxmert as the backbone model and extend several debiasing methods to it for indepth research on compressing and debiasing.For completeness, we also conduct experiments on the popular VLP visualBERT (Li et al., 2019).

A.3 Model Compression and Robustness
Model compression techniques for Transformerbased pre-trained models are well developed (mainly around BERT), including pruning (Gale et al., 2019;Gordon et al., 2020;Michel et al., 2019), knowledge distillation (Jiao et al., 2020;Sanh et al., 2019;Sun et al., 2019), parameter sharing (Lan et al., 2020) and quantization (Zafrir et al., 2019;Zhang et al., 2020).Inspired by lottery ticket hypothesis (Frankle and Carbin, 2019), many recent studies show that BERT can be pruned to a sparse subnetwork after (Gale et al., 2019) and before fine-tuning (Chen et al., 2020b;Liang et al., 2021a;Liu et al., 2022;Prasanna et al., 2020), without performance degrading.On this basis, we extend the pruning paradigm to the fine-tuned lxmert for OOD scenario in VQA, which incorporates the debiasing methods when fine-tuning and pruning.In the NLP and CV fields, some recent efforts have also been made to study model compression and robustness to adversarial attacks (Fu et al., 2021;Gui et al., 2019;Sehwag et al., 2020;Xu et al., 2021;Ye et al., 2019) and spurious correlations (Du et al., 2021;Xu et al., 2021) (which is more common than the worst-case adversarial attack).Dataset-bias problem is a typical symptom of spurious correlations and poses a challenge to VQA models.We are the first to thoroughly investigate the sparsity and OOD robustness for VLPs in VQA.

B.1 lxmert Architecture and Subnetworks
For lxmert, the embedding layer and visual fc layer map language-modality input (token sequences obtained by WordPiece tokenizer) and visionmodality input (36 object features obtained by Faster R-CNN (Ren et al., 2015)) into the samedimension space.The pooler layer connects the Transformer top layer and the classifier.The Transformer layers involve three encoders 7 : language encoder (L enc ), object relationship encoder (R enc ) and cross-modality encoder (C enc ), and are usually composed of attention modules and feed-forward networks (FFN).
The attention modules have four kinds of weight matrices, i.e., the query, key and value matrices W Q,K,V ∈ R d model ×d model , and the output matrix We adopt unstructured pruning to obtain a compressed version (i.e., a subnetwork) of the original VLPs.Specifically, given a VLP f (θ) with 7 Each Transformer layer of the language encoder and object relationship encoder has a multi-head self-attention module and a feed-forward network (FFN).Each Transformer layer of the cross-modality encoder has a language self-attention module, a visual self-attention module and a multi-head crossattention module.Only the language self-attention and visual self-attention modules are followed by FFN.All the weight matrices of Transformer layers are summarized in eq.12. parameters θ, we apply a binary pruning mask m ∈ {0, 1} |θ| to the model parameters, which gives rise to f (m ⊙ θ), where ⊙ is the elementwise product.For lxmert, we focus on the embedding layer, visual fc layer, pooler layer and Transformer layers of which the parameters are pre-trained, while the classifier is excluded.The language encoder, visual encoder, cross-modality encoder have T , I and X Transformer layers respectively.The parameters to be pruned are: where W emb , W vis-fc and W plr are the weights of embedding layer, vision fc layer and pool layer, θ Lenc ∪ θ Renc ∪ θ Xenc are the parameters of Transformer layers: (12) where CX, CL and CR are the language selfattention, visual self-attention and cross-attention modules respectively.

B.2 visualBERT Architecture and Subnetworks
Similar to lxmert, visualBERT is composed of an embedding layer, a visual projection layer, a pooler layer, a stack of Transformer layers.Differently, visualBERT's Transformer layers only involve a single encoder (V enc ).The parameters of visual-BERT to be pruned are: where W emb and W plr are the weights of embedding layer and pool layer, θ Venc are the parameters of Transformer layers: where V = 12.

B.3 LMH details
LMH takes a step further based on Produce of Experts (PoE) (Hinton, 2002), which simply combines the predicted distributions of the main model and the biasd model as follows: where p b is the predicted distribution of biased model, and indicates the bias degree of the sample.In this way, when a sample is heavily biased, that is, p b is large, the main model will not output a large p m for it during training.Following (Clark et al., 2019), we directly use the answers' frequency under each question type as p b .
To selectively adjust the main model's behavior, LMH adds a learn function g to explicitly deter-mine how much to trust the learned biases: where h is the cross-modality representation from the last hidden layer of lxmert, w is trainable.To prevent p b being ignored, LMH also adds an entropy penalty item R, and the final loss is computed as:

B.4 Model and Implementation Details
Lxmert has about 202M parameters, and 197.7M parameters are involved in the pruning process (4.5M parameters are left to the classifier).The three modules from different modalities, namely the language module, the visual module and the cross-modality module, contain 83.1M, 35.3M and 78.8M parameters respectively.We train the models for 20 epochs with a batch size of 128 on two Tesla-V100-32G or 256 on A100-80GB.The AdamW (Loshchilov and Hutter, 2017) optimizer is adopted with a learning rate of 5e-5.Our codes are based on the huggingface transformers library (Wolf et al., 2020).We adopt visualBERT of its coco-pre version which is pre-trained with COCO (Chen et al., 2015) dataset.
C More Experiments on VQA-CP v2 C.1 Performance of Subnetworks on Three Types of Questions Subnetworks from BCE Fine-tuned lxmert.
For the three types of questions, as shown in the right three plots of Fig. 12 (upper), we find that: 1) The performance on "Num" questions is sensitive to the varying sparsity levels while that on "Y/N" questions is relatively stable in general except at 90% sparsity.Specially, with the increase of sparsity, the performance on "Num" questions of "mask train(lmh)" and "OMP + lmh ft" counterintuitively greatly promote.This shows that language biases for the "Num" questions exist in a large proportion of the parameters of biased lxmert.2) For the "Other" questions, debiasing methods have little gain on the performance of subnetworks.For example, the performance of "mask train(lmh)" is similar with that of "mask train(bce)".This indicates that the language biases for "Other" questions is minor in training set.Therefore, "Other" questions request more reasoning than debiasing.3)There is a sharp decline of all the subnetworks' performance on "Other" questions from 70% ∼ 90% sparsity.We conjecture that this is because reducing the model's capacity too drastically hurt the reasoning ability which is necessary to answer the "Other" questions correctly.
The right three plots of Fig. 12 (lower) shows the performance of LMH fine-tuned lxmert subnetworks on different types of questions.For the "Num" questions, when compressing LMH finetuned lxmert (the grey and maroon lines), the performance of subnetworks no longer rises with sparsity growth.This demonstrates that language biases for the "Num" questions exist in a much smaller proportion of the parameters of debiased lxmert than that of biased lxmert.For "Other" questions, "lxmert(bce) + mask train(lmh)" is consistently superior to "lxmert(lmh) + mask train(lmh)", which demonstrates that further debiasing the debiased full lxmert in the pruning process sacrifices the reasoning ability.

C.3 A Close Look at The Performance of
Subnetworks at 90% Sparsity From Fig. 14, we two abnormal observations at the extremely high sparsity, i.e., 90%: 1) Pruning with "OMP + lmh ft" (pink and grey lines) is better than pruning with "mask train(lmh)" (cyan and brown lines).2) Starting from "lxmert(bce)" (pink and cyan lines) is better than starting from "lxmert(lmh)" (grey and brown lines).The two observations at 90% sparsity are contrary to other sparsity.For the first observation, we conjecture that this is because mask training (which involves binarization and gradient estimation) is more difficult to optimize at 90% compared with further fine-tuning of the OMP subnetworks.The second observation can be explained by that: Further debiasing the debiased full lxmert in the pruning process slightly sacrifices the performance on "Other" questions, which require more reasoning ability than debiasing ability (as shown in the rightmost two plots of Fig. 12).Therefore, at the extremely high sparsity, when the benefits of debiasing on "Y/N" and "Num" questions are small, the performance penalty on "Other" questions results in a drop in "Overall" accuracy.Nevertheless, the gaps between "lxmert(lmh) + mask train(lmh)" and the other two pipelines are small at 90% sparsity.

C.4 Sparsity Configurations for the Three Modality-specific Modules
For the overall target sparsity of 50% and 70%, we adopt the following procedure to search the comfortable zone for the modality-specific sparsity: First, we traverse [10%, 30%, 50%, 70%, 90%] (i.e., step size of 20%) to assign modality-specific sparsity for any two modules, and compute the modality-specific sparsity for the remaining one8 according to eq. 10 in the main paper.From the experimental results of these sparsity configurations, we can determine the approximate range where the pruned subnetworks perform better.
Second, we use the same method to traverse the reduced range with a smaller step size of 5%.In this way, we can determine the most comfortable zone for the modality-specific sparsity.
Similarly, when the overall target sparsity is 90%, we directly traverse 80% ∼ 98% with a step size of 2% to search the most comfortable zone of the modality-specific sparsity.
D More Experiments on VQA-VS D.1 Performance on varying OOD test sets of VQA-VS The Effect of Compression without Debiasing For simplicity, we categorize the nine OOD test sets into 3 categories of different modalities, i.e., language-based (OOD-lang), visual-based (OODvis) and cross-modality (OOD-crsM) ones.We report the average accuracy of each category, as well as the IID accuracy and the average accuracy of all OOD test sets (OOD-mean) in Fig. 15.The upper part of Fig. 15 shows the performance of subnetworks compressed without debiasing method, it can be seen that: 1) All subnetworks obtained by pruning all three modules underperform "full model(bce)" in ID test set.This is because the ID performance relies on memory ability, which is positively related to the parameter quantity.2) The subnetworks obtained by pruning the language module consistently outperform the full model on OOD-mean, OOD-lang and OOD-crsM test sets, which are related to the language bias.This indicates that the language module of lxmert is slightly overparameterized.3) In contrast, pruning other modules causes a negative impact on OOD performance.Especially, pruning visual modules also results in a sharp OOD-vis accuracy drop, indicating that the visual module of lxmert is not suitable for compression.

The Effect of Compression with Debiasing
The lower part of Fig. 15 shows the VQA-VS perfor-

Figure 4 :Figure 5 :
Figure 4: Results of lxmert subnetworks obtained from different training and compressing pipelines on VQA-CP v2."ft" means further fine-tuning the subnetworks in Stage3.

Figure 7 :
Figure 7: Results of visualBERT subnetworks that adopt different debiasing methods in Stage1 and Stage2 on VQA-CP v2.

(Figure 8 :
Figure 8: Results of subnetworks pruned by different sparsity configurations on VQA-CP v2 using "lxmert(lmh) + mask train(lmh)".Red and blue lines denote the coordinates of the data point with uniform sparsity across three modules and the data point performing the best (the specific configuration is shown below each plot) respectively.The overall sparsities are shown in the titles.

Figure 11 :
Figure 11: Comparison of "lxmert(lmh) + mask train(lmh)" subnetworks with uniform and modalityspecific sparsity on VQA-VS.Results on specific OOD test sets can be found in App.D.2
Figure 14: Results of subnetworks obtained by pruning with debiasing method LMH on VQA-CP v2.