Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering

Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition. However, we uncover a striking contrast to this promise: across 5 models and 4 datasets on the task of visual question answering, a wide variety of active learning approaches fail to outperform random selection. To understand this discrepancy, we profile 8 active learning methods on a per-example basis, and identify the problem as collective outliers – groups of examples that active learning methods prefer to acquire but models fail to learn (e.g., questions that ask about text in images or require external knowledge). Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work.


Introduction
Today, language-equipped vision systems such as VizWiz, TapTapSee, BeMyEyes, and CamFind are actively being deployed across a broad spectrum of users. 1 As underlying methods improve, these systems will be expected to operate over diverse visual environments and understand myriad language inputs (Bigham et al., 2010;Tellex et al., 2011;Mei et al., 2016;Anderson et al., 2018b;Park et al., 2019). Visual Question Answering (VQA), the task of answering questions about Figure 1: We systematically evaluate active learning on VQA datasets and isolate their inability to perform better than random sampling due to the presence of collective outliers. Active learning methods prefer to acquire these outliers, which are hard and often impossible for models to learn. We show that Dataset Maps, like the one shown here, can heuristically identify these collective outliers as examples assigned low model confidence and prediction variability during training. visual inputs, is a popular benchmark used to evaluate progress towards such open-ended systems (Agrawal et al., 2015;Krishna et al., 2017;Gordon et al., 2018;Hudson and Manning, 2019). Unfortunately, today's VQA models are data hungry: Their performance scales monotonically with more train-ing data (Lu et al., 2016;Lin and Parikh, 2017), motivating the need for data acquisition mechanisms such as active learning, which maximize performance while minimizing expensive data labeling.
While active learning is often key to effective data acquisition when such labeled data is difficult to obtain (Lewis and Catlett, 1994;Tong and Koller, 2001;Culotta and McCallum, 2005;Settles, 2009), we find that 8 modern active learning methods (Gal et al., 2017;Siddhant and Lipton, 2018;Lowell et al., 2019) show little to no improvement in sample efficiency across 5 models on 4 VQA datasets -indeed, in some cases performing worse than randomly selecting data to label. This finding is in stark contrast to the successful application of active learning methods on a variety of traditional tasks, such as topic classification (Siddhant and Lipton, 2018;Lowell et al., 2019), object recognition (Deng et al., 2018), digit classification (Gal et al., 2017), and named entity recognition (Shen et al., 2017). Our negative results hold even when accounting for common active learning ailments: cold starts, correlated sampling, and uncalibrated uncertainty. We mitigate the cold start challenge of needing a representative initial dataset by varying the size of the seed set in our experiments. We account for sampling correlated data within a given batch by including Core-Set selection (Sener and Savarese, 2018) in the set of active learning methods we evaluate. Finally, we use deep Bayesian active learning to calibrate model uncertainty to high-dimensional data (Houlsby et al., 2011;Gal and Ghahramani, 2016;Gal et al., 2017).
After concluding that negative results are consistent across all experimental conditions, we investigate active learning's ineffectiveness on VQA as a data problem and identify the existence of collective outliers (Han and Kamber, 2000) as the source of the problem. Leveraging recent advances in model interpretability, we build Dataset Maps (Swayamdipta et al., 2020), which distinguish between collective outliers and useful data that improve validation set performance (see Figure 1). While global outliers deviate from the rest of the data and are often a consequence of labeling error, collective outliers cluster together; they may not individually be identifiable as outliers but collectively deviate from other examples in the dataset. For instance, VQA-2 (Goyal et al., 2017) is riddled with collections of hard questions that require external knowledge to answer (e.g., "What is the symbol on the hood often associated with?") or that ask the model to read text in the images (e.g., "What is the word on the wall?"). Similarly, GQA (Hudson and Manning, 2019) asks underspecified questions (e.g., "what is the person wearing?" which can have multiple correct answers). Collective outliers are not specific to VQA, but can similarly be found in many open-ended tasks, including visual navigation (Anderson et al., 2018b) (e.g., "Go to the grandfather clock" requires identifying rare grandfather clocks), and open-domain question answering (Kwiatkowski et al., 2019), amongst others.
Using Dataset Maps, we profile active learning methods and show that they prefer acquiring collective outliers that models are unable to learn, explaining their poor improvements in sample efficiency relative to random sampling. Building on this, we use these maps to perform ablations where we identify and remove outliers iteratively from the active learning pool, observing correlated improvements in sample efficiency. This allows us to conclude that collective outliers are, indeed, responsible for the ineffectiveness of active learning for VQA. We end with prescriptive suggestions for future work in building active learning methods robust to these types of outliers.

Related Work
Our work tests the utility of multiple recent active learning methods on the open-ended understanding task of VQA. We draw on the dataset analysis literature to identify collective outliers as the bottleneck hindering active learning methods in this setting.
Interpreting and Analyzing Datasets. Given the prevalence of large datasets in modern machine learning, it is critical to assess dataset properties to remove redundancies (Gururangan et al., 2018;Li and Vasconcelos, 2019) or biases (Torralba and Efros, 2011;Khosla et al., 2012;Bolukbasi et al., 2016), both of which negatively impact sample efficiency. Prior work has used training dynamics to find examples which are frequently forgotten (Krymolowski, 2002;Toneva et al., 2019) versus those that are easy to learn (Bras et al., 2020). This work suggests using two model-specific measures confidence and prediction variance -as indicators of a training example's "learnability" (Chang et al., 2017;Swayamdipta et al., 2020). Dataset Maps (Swayamdipta et al., 2020), a recently introduced framework uses these two measures to profile datasets to find learnable examples. Unlike prior datasets analyzed by Dataset Maps that have a small number of global outliers as hard examples, we discover that VQA datasets contain copious amounts of collective outliers, which are difficult or even impossible for models to learn.

Active Learning Experimental Setup
We adopt the standard pool-based active learning setup from prior work (Lewis and Gale, 1994;Settles, 2009;Gal et al., 2017;Lin and Parikh, 2017), consisting of a model M, initial seed set of labeled examples (x i , y i ) ∈ D seed used to initialize M, an unlabeled pool of data D pool , and an acquisition function A(x, M). We run active learning over a series of acquisition iterations T where at each iteration we acquire a batch of B new examples per:x ∈ D pool to label per x = arg max x∈D pool A(x, M).
Acquiring an example often refers to using an oracle or human expert to annotate a new example with a correct label. We follow prior work to simulate an oracle using existing datasets, forming D seed from a fixed percentage of the full dataset, and using the remainder as D pool (Gal et al., 2017;Lin and Parikh, 2017;Siddhant and Lipton, 2018). We re-train M after each acquisition iteration.
Prior work has noted the impact of seed set size on active learning performance (Lin and Parikh, 2017;Misra et al., 2018;Jedoui et al., 2019). We run multiple active learning evaluations with varying seed set sizes (ranging from 5% to 50% of the full pool size). We keep the size of each acquisition batch B to a constant 10% of the overall pool size.

Models
Visual Question Answering (VQA) requires reasoning over two modalities: images and text. Most models use feature "backbones" (e.g., features from object recognition models pretrained on Ima-geNet, and pretrained word vectors for text). We evaluate with a representative sample of existing VQA models, including the following: 2 LogReg is a logistic regression model that uses either ResNet-101 or Faster R-CNN image features with mean-pooled GloVe question embeddings (Pennington et al., 2014). Although these models 7268 are not as performant as the subsequent models, logistic regression has been effective on VQA (Suhr et al., 2019), and is pervasive in the active learning literature (Schein and Ungar, 2007;Yang and Loog, 2018;Mussmann and Liang, 2018).
LSTM-CNN is a standard model introduced with VQA-1 (Agrawal et al., 2015). We use more performant ResNet-101 features instead of the original VGGNet features as our visual backbone.
BUTD (Bottom-Up Top-Down Attention) uses object-based features in tandem with attention over objects (Anderson et al., 2018a). BUTD won the 2017 VQA Challenge (Teney et al., 2018), and has been a consistent baseline for recent work in VQA.

Acquisition Functions
Several active learning methods have been developed to account for different aspects of the machine learning training pipeline: while some acquire examples with high aleotoric uncertainty (Settles, 2009) (having to do with the natural uncertainty in the data) or epistemic uncertainty (Gal et al., 2017) (having to do with the uncertainty in the modeling/learning process), others attempt to acquire examples that reflect the distribution of data in the pool (Sener and Savarese, 2018). We sample a diverse set of these methods: Random Sampling serves as our baseline passive approach for acquiring examples.
Least Confidence acquires examples with lowest model prediction probability (Settles, 2009).
Entropy acquires examples with the highest entropy in the model's output (Settles, 2009).

MC-Dropout Entropy (Monte-Carlo Dropout with Entropy acquisition) acquires examples with
high entropy in the model's output averaged over multiple passes through a neural network with different dropout masks (Gal and Ghahramani, 2016). This process is a consequence of a theoretical casting of dropout as approximate Bayesian inference in deep Gaussian processes.
BALD (Bayesian Active Learning by Disagreement) builds upon Monte-Carlo Dropout by proposing a decision theoretic objective; it acquires examples that maximise the decrease in expected posterior entropy (Houlsby et al., 2011;Gal et al., 2017;Siddhant and Lipton, 2018) -capturing "disagreement" across different dropout masks.

Core-Set
Selection samples examples that capture the diversity of the data pool (Sener and Savarese, 2018;Coleman et al., 2020). It acquires examples to minimize the distance between an example in the unlabeled pool to its closest labeled example. Since Core-Set selection operates over a representation space (and not an output distribution, like prior strategies) and VQA models operate over two modalities, we employ three Core-Set variants: Core-Set (Language) and Core-Set (Vision) operate over their respective representation spaces while Core-Set (Fused) operates over the "fused" vision and language representation space.

Experimental Results
We evaluate the 8 active learning strategies across the 5 models described in the previous section. Figures 2-5 show a representative sample of active learning results across datasets. Due to space constraints, we only visualize 4 active learning strategies -Least-Confidence, BALD, CoreSet-Fused, and the Random Baseline -using 3 models (LSTM-CNN, BUTD, LXMERT). 4 Results and trends are consistent across the different acquisition functions, models and seed set sizes (see the appendix for results with other models, acquisition functions, and seed set sizes). We now go on to provide descriptions of the datasets we evaluate against, and the corresponding results. : Results for varied active learning methods on VQA-Sports, a simplified VQA dataset. Strategies perform on par with or worse than the random baseline, when using 10% of the full dataset as the seed set. : Results for the full VQA-2 dataset, also using 10% of the full dataset as a seed set. Similar to the plot above, all active learning methods perform similar to a random baseline. : Results on VQA-2 using 50% of the dataset as a seed set. While methods are relatively better when using a larger seed set-confirming results from (Lin and Parikh, 2017)-no methods outperform random. : Results on GQA using 10% of the dataset for the seed set. Even with different question structures, the above trends hold, with strategies performing worse than or equivalent to random. Acquisitions by Difficulty BALD Figure 6: We visualize the difference in acquisition preferences between random and active learning acquisitions (least confidence and BALD) across multiple iterations. Active learning methods prefer to sample impossible examples which models are unable to learn, hurting sample efficiency relative to the random baseline.

Simplified VQA Datasets
One complexity of VQA is the size of the output space and the number of examples present (Agrawal et al., 2015;Goyal et al., 2017); VQA-2 has 400k training examples, and in excess of 3k possible answers (see Table 1). However, prior work in active learning focuses on smaller datasets like the 10-class MNIST dataset (Gal et al., 2017), binary classification (Siddhant and Lipton, 2018), or small-cardinality (≤ 20 classes) text categorization (Lowell et al., 2019). To ensure our results and conclusions are not due to the size of the output space, we build two meaningful, but narrow-domain VQA datasets from subsets of VQA-2. These simplified datasets reduce the complexity of the underlying learning problem and provide a fair comparison to existing active learning literature.

VQA-Sports.
We generate VQA-Sports by compiling a list of 20 popular sports (e.g., soccer, football, tennis, etc.) in VQA-2, and restricting the set of questions to those with answers in this list. We picked the sports categories by ranking the GloVe vector similarity between the word "sports" to answers in VQA-2, and selected the 20 most commonly occurring answers.
VQA-Food. We generate the VQA-Food dataset similarly, compiling a list of the 20 commonly occurring food categories by GloVe vector similarity to the word "food." Results. Figure 2 presents results for VQA-Sports, with an initial seed set restricted to 10% of the total pool (500 examples). The appendix reports similar results on VQA-Food. For LSTM-CNN, Least-Confidence appears to be slightly more sample efficient, while all other strategies perform on par with or worse than random. For BUTD, all methods are on par with random; for LXMERT, they perform worse than random. Generally on VQA-Sports, active learning performance varies, but fails to outperform random acquisition.

VQA-2
VQA-2 is the canonical dataset for evaluating VQA models (Goyal et al., 2017). In keeping with prior work (Anderson et al., 2018a; Tan and Bansal, 2019), we filter the training set to only include answers that appear at least 9 times, resulting in 3130 unique answers. Unlike traditional VQA-2 evaluation, which treats the task as a multi-label binary classification problem, we follow prior active learning work on VQA (Lin and Parikh, 2017), which formulates it as a multi-class classification problem, enabling the use of acquisition functions such as uncertainty sampling and BALD.  the right of?". We use the standard GQA training set of 943k questions, 900k of which we use for the active learning pool.

Results
Results. Figure 5 shows results on GQA using a seed set of 10% of the full pool (90k examples). Despite its notable differences in question structure to VQA-2, active learning still performs on par with or slightly worse than random.

Analysis via Dataset Maps
The previous section shows that active learning fails to improve over random acquisition on VQA across models and datasets. A simple question remains -why? One hypothesis is that sample inefficiency stems from the data itself: there is only a 2% gain in validation accuracy when training on half versus the whole dataset. Working from this, we characterize the underlying datasets using Dataset Maps (Swayamdipta et al., 2020) and discover that active learning methods prefer sampling "hard-tolearn" examples, leading to poor performance.

Collective Outliers
This leaves two questions: 1) can we characterize these "hard" examples, and 2) are these examples responsible for the ineffectiveness of active learning on VQA? We first identify hard-to-learn examples as collective outliers and explain why active learning methods prefer to acquire them. Next, we perform ablation experiments, removing these outliers from the active learning pool iteratively, and demonstrate a corresponding boost in sample efficiency relative to random acquisition.  (Figure 7), in VQA-2, we identify clusters of hard-to-learn examples that require optical character recognition (OCR) for reasoning about text (e.g., "What is the first word on the black car?"); another cluster requires external knowledge to answer ("What is the symbol on the hood often associated with?"). In GQA, we identify different clusters of collective outliers; one cluster stems from innate underspecification (e.g., "what is on the shelf?" with multiple objects present on the shelf); another cluster requires multiple reasoning hops difficult for current models (e.g., "What is the vehicle that is driving down the road the box is on the side of?"). We sample 100 random "hard-to-learn" examples from both VQA-2 and GQA and find that 100% of the examples belong to one of the two aforementioned collectives. Since hard-to-learn examples constitute 25-30% of the data pool, active learning methods cannot avoid them. Uncertainty-based methods (e.g., Least-Confidence, Entropy, Monte-Carlo Dropout) identify them as valid acquisition targets because models lack the capacity to correctly answer these examples, assigning low confidence and high uncertainty. Disagreementbased methods (e.g., BALD) are similar; model confidence is generally low but high variance (lower middle/lower right of the Dataset Maps). Finally, diversity methods (e.g., Core-Set selection) identify these examples as different enough from the existing pool to warrant acquisition, but fail to learn meaningful representations, fueling a vicious cycle wherein they continue to pick these examples.
Ablating Outliers. To verify that collective outliers are responsible for the degradation of active learning performance, we re-run our experiments using active learning pools with varying numbers of outliers removed. To remove these outliers, we sort and remove all examples in the data pool using the product of their model confidence and prediction variability (x and y-axis values of the Dataset Maps). We systematically remove examples with a low product value and observe how active learning performance changes (see Figure 8).
We observe a 2-3x improvement in sample efficiency when removing 50% of the entire data pool, consisting mainly of collective outliers (Figure 8c). This improvement decreases if we only remove 25% of the full pool (Figure 8b), and further degrades if we remove only 10% (Figure 8a). This ablation demonstrates that active learning methods are more sample efficient than the random baseline when collective outliers are absent from the unlabelled pool.

Discussion and Future Work
This paper asks a simple question -why does the modern neural active learning toolkit fail when applied to complex, open ended tasks? While we focus on VQA, collective outliers are abundant in tasks such as natural language inference (Bowman et al., 2015;Williams et al., 2018) and opendomain question answering (Kwiatkowski et al., 2019), amongst others. More insidious is their nature; collective outliers can take multiple forms, requiring external domain knowledge or "commonsense" reasoning, containing underspecification, or requiring capabilities beyond the scope of a given model (e.g., requiring OCR ability). While we perform ablations in this work removing collective outliers, demonstrating that active learning fails as collective outliers take up larger portions of the dataset, this is only an analytical tool; these outliers are, and will continue to be, pervasive in open-ended datasets -and as such, we will need to develop better tools for learning (and performing active learning) in their presence.
Selective Classification. One potential direction for future work is to develop systems that abstain when they encounter collective outliers. Historical artificial intelligence systems, such as SHRDLU (Winograd, 1972) and QUALM (Lehnert, 1977), were designed to flag input sequences that they were not designed to parse. Ideas from those methods can and should be resurrected using modern techniques; for example, recent work suggests that a simple classifier can be trained to identify out-ofdomain data inputs, provided a seed out-of-domain dataset (Kamath et al., 2020). Active learning methods can be augmented with a similar classifier, which re-calibrates active learning uncertainty scores with this classifier's predictions. Other work learns to identify novel utterances by learning to intelligently set thresholds in representation space (Karamcheti et al., 2020), a powerful idea especially if combined with other representation-centric active learning methods like Core-Set Sampling (Sener and Savarese, 2018).

Active Learning with Global
Reasoning. Another direction for future work to explore is to leverage Dataset Maps to perform more global, holistic reasoning over datasets, to intelligently identify promising examples -in a sense, baking part of the analysis done in this work directly into the active learning algorithms. A possible instantiation of this idea would be in training a discriminator to differentiate between "learnable" examples (upper half of each Dataset Map) from the "unlearnable", collective outliers with low confidence and low variability. Between each active learning acquisition iteration, one can generate an updated Dataset Map, thereby reflecting what models are learning as they obtain new labeled examples.
Machine learning systems deployed in realworld settings will inevitably encounter open-world datasets, ones that contain a mixture of learnable and unlearnable inputs. Our work provides a framework to study when models encounter such inputs. Overall, we hope that our experiments serve as a catalyst for future work on evaluating active learning methods with inputs drawn from open-world datasets.

Reproducibility
All code for data preprocessing, model implementation, and active learning algorithms is made available at https://github.com/siddk/vqa-outliers. Additionally, this repository also contains the full set of results and dataset maps as well.
The authors are fully committed to maintaining this repository, in terms of both functionality and ease of use, and will actively monitor both email and Github Issues should there be problems.

A Overview
Due to the broad scope of our experiments and analysis, we were unable to fit all our results in the main body of the paper. Furthermore, given the limited length provided by the appendix, we provide only salient implementation details and other representative results here; however, we make all code, models, data, results, active learning implementations available at this link: https: //github.com/siddk/vqa-outliers.
Generally, any combination of {active learning strategy × model × seed set size × analysis/acquisition plot} is present in this paper, and is available in the public code repository.

B.1 Models & Training
Where applicable, we implement our models based on publicly available PyTorch implementations. For the LSTM-CNN model, we base our implementation off of this repository: https://github.com/ Shivanshu-Gupta/Visual-Question-Answering, while for the Bottom-Up Top-Down Attention Model, we use this repository: https://github.com/ hengyuan-hu/bottom-up-attention-vqa, keeping default hyperparameters the same.
Logistic Regression. When implementing Logistic Regression, we base our PyTorch implementation on the broadly used Scikit-Learn (https: //scikit-learn.org) implementation, using the default parameters (including L2 weight decay). We optimize our models via stochastic gradient descent.
LXMERT. As mentioned in Section 3, the default LXMERT checkpoint and fine-tuning code made publicly available in Tan and Bansal (2019) (associated code repository: https://github.com/ airsplay/lxmert) is pretrained on data from VQA-2 and GQA, leaking information that could substantially affect our active learning results. To mitigate this, we contacted the authors, who kindly provided us with a checkpoint of the model without VQA pretraining.
However, in addition to this model obtaining different results from those reported in the original work, the provided pretrained checkpoint behaves slightly differently during fine-tuning, requiring different hyperparameters than provided in the original repository. We perform a coarse grid search over hyperparameters, using the LXMERT implementation provided by HuggingFace Transformers (Wolf et al., 2019), and find that using an AdamW optimizer rather than the BERT-Adam Optimizer used in the original work without any special learning rate scheduling results in the best fine-tuning performance.

B.2 Acquisition Functions
We use standard implementations of the 8 active learning strategies described, borrowing from prior implementations (Mussmann and Liang, 2018) and existing code repositories (https://github.com/ google/active-learning). We provide additional details below.
Monte-Carlo Dropout. For our implementations of the deep Bayesian active learning methods (Monte-Carlo Dropout w/ Entropy, BALD), we follow Gal and Ghahramani (2016) and estimate a Dropout distribution via test-time dropout, running multiple forward passes through our neural networks, with different, randomly sampled Dropout masks. We use a value of k = 10 forward passes to form our Dropout distribution.
Amortized Core-Set Selection. In the original Core-Set selection active learning work introduced by Sener and Savarese (2018), it is shown that Core-Set selection for active learning can be reduced to a version of the k-centers problem, which can be solved approximately (2-OPT) with a greedy algorithm. However, running this algorithm on highdimensional representations, across large pools can be prohibitive; Core-Set selection is batch-aware, requiring recomputing distances from each "clustercenter" (points in the set of acquired examples) to all points in the active learning pool after each acquisition in a batch. While we can run this out completely for smaller datasets (and indeed, this is what we do for our small datasets VQA-Sports and VQA-Food), a single acquisition iteration for a large dataset for the full VQA-2 dataset takes approximately 20 GPU-hours on the resources we have available, or up to 9 days for a single Core-Set selection run. For GQA, performing exact Core-Set selection takes at least twice as long.
To still capture the spirit of Core-Set diversitybased selection in our evaluation, we instead introduce an amortized implementation of Core-Set selection, which is comprised of two steps. We first downsample the high-dimensional representations (of either the fused language and text, or either unimodal representations) via Principal Component Analysis (PCA) to make the distance computation faster by an order of magnitude. Then, rather than updating distances from examples in our acquired set to points in our pool after each acquisitionx, we delay updates, instead only refreshing the distance computation every 2000 acquisitions (roughly 5% of an acquisition batch for VQA-2). This allows us to report results for Core-Set selection with the three different proposed representations (Fused, Language-Only, Vision-Only) for VQA-2; unfortunately, for GQA and LXMERT (due to the high cost of training), even running this amortized version of Core-Set selection is prohibitive, so we report a subset of results, and omit the rest.

C Active Learning Results
We include further results from our study of active learning applied to VQA, including results on VQA-Food (not included in the main body), active learning results for the two logistic regression models -Log-Reg (ResNet-101) and Log-Reg (Faster R-CNN), as well as with the 4 acquisition strategies not included in the main body of the paper -Entropy, Monte-Carlo Dropout w/ Entropy, Core-Set (Language), and Core-Set (Vision). Figure 9 shows results on VQA-Food with the LSTM-CNN, BUTD, and LXMERT models, with a seed set comprised of 10% of the total pool. The results are mostly similar to those reported in the paper; strategies track or underperform random sampling, with the exception of Least-Confidence for the LSTM-CNN model -however, this is the sole exception, and the LSTM-CNN has the highest training variance of all the models we try. Figure 10 shows active learning results for the Lo-gReg (ResNet-101) model on VQA-Sports (seed set = 10%), and VQA-2 (seed set = 10%, 50%). Results are similar to those reported in the paper, with active learning failing to outperform random acqusition. Figure 11 presents the same set of experiments as the prior section, except with the LogReg (Faster R-CNN) model. While the object-based Faster R-CNN representation enables much higher performance than the ResNet-101 representation, active learning results are consistent with those reported in the paper. Figure 12 presents results for the four other active learning strategies we implement -Entropy, Monte Carlo Dropout w/ Entropy, Core-Set (Language), and Core-Set (Vision) -for the BUTD model. Results are across VQA-Sports (seed set = 10%), and VQA-2 (seed set = 10%, 50%) -despite the unique features of each strategy, the trends remain consistent with those in the paper.   Figure 10: Active learning results using the Logistic Regression (ResNet-101) model on VQA-Sports (10% seed set), and VQA-2 (10% and 50% seed set). Most strategies either track or underperform random acquisition. 5 0 0 1 K 1 . 5 K 2 K 2 . 5 K 3 K 3 . 5 K 4 K 4 . 5 K 5 K Number of Training Examples  Figure 11: Active learning results using the Logistic Regression (Faster R-CNN) model on VQA-Sports (10% seed set), and VQA-2 (10% and 50% seed set). While the Faster R-CNN representation leads to better validation accuracies, active learning performance remains consistent. 5 0 0 1 K 1 . 5 K 2 K 2 . 5 K 3 K 3 . 5 K 4 K 4 . 5 K 5 K Number of Training Examples  Figure 12: Results with the BUTD on VQA-Sports, VQA-2 and GQA using the alternative 4 acquisition strategies not included in the main body of the paper. Unsurprisingly, results are consistent with those reported in the paper.