Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions

Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.


Introduction
In the task of Visual Question Answering (VQA), given an image and a natural language question about the image, a system is required to answer the question accurately (Antol et al., 2015). While the accuracy of these systems appears to be constantly improving (Fukui et al., 2016;Lu et al., 2016), they are sensitive to small perturbations in their input and seem overfitted to their training data (Kafle et al., 2019).
To address the problem of overfitting, the VQA-CP dataset was proposed (Agrawal et al., 2018). It is a reshuffling of the original VQA dataset, such that the distribution of answers per question type (e.g., "what color", "how many") differs between the train and test sets. Using VQA-CP, Kafle et al. (2019) demonstrated the poor out-of-distribution generalization of many VQA systems. While many models were subsequently designed to deal with the VQA-CP dataset (Cadene et al., 2019;Clark et al., 2019;Chen et al., 2020;Gat et al., 2020), aiming to solve the out-of-distribution generalization problem in VQA, they were later demonstrated to overfit the unique properties of this dataset (Teney et al., 2020). Moreover, no measures for robustness to distribution shifts have been proposed.
In this work we propose a consistency-based measure that can indicate on the robustness of VQA models to distribution shifts. Our robustness measure is based on counterfactual data augmentations (CADs), which were shown useful for both training (Kaushik et al., 2019) and evaluation (Garg et al., 2019;Agarwal et al., 2020). CADs are aimed at manipulating a specific property while preserving all other information, allowing us to evaluate the robustness of the model to changes to this property.
For example, consider transforming a "what color" question to a "yes/no" question, as depicted in Figure 1. The counterfactual reasoning for such a transformation is: "what would be the question if it had a yes/no answer?". While VQA models have seen many of both question types, their combination (yes/no questions about color) has been scarcely seen. If a model errs on such a combination, this suggests that to answer the original question correctly, the model uses a spurious signal such as the correlation between the appearance of the word "color" in the question and a particular color in the answer (e.g. here, color ⇒ white). Further, this example shows that some models cannot even identify that they are being asked a "yes/no" question, distracted by the word "color" in the augmented question and answering "green".
Our robustness measure is named RAD: Robustness to (counterfactually) Augmented Data (Section 2.1). RAD receives (image, question, answer) triplets, each augmented with a triplet where the question and answer were manipulated. It measures the consistency of model predictions when changing a triplet to its augmentation, i.e., the robustness of the model to (counterfactual) augmentations. We show that using RAD with focused interventions may uncover substantial weaknesses to specific phenomenon (Section 3.2), namely, users are encouraged to precisely define their interventions such that they create counterfactual augmentations. As a result, pairing RAD values with accuracy gives a better description of model behavior.
In general, to effectively choose a model in complex tasks, complementary measures are required (D'Amour et al., 2020). Thus, it is important to have interpretable measures that are widely applicable. Note that in this work we manipulate only textual inputs -questions and answers, but RAD can be applied to any dataset for which augmentations are available. In particular, exploring visual augmentations would be beneficial for the analysis of VQA systems. Further, representation-level counterfactual augmentations are also valid, which is useful when generating meaningful counterfactual text is difficult (Feder et al., 2020).
Our augmentations (CADs) are generated semiautomatically (Section 2.2), allowing us to directly intervene on a property of choice through simple templates. As in the above example, our augmentations are based on compositions of two frequent properties in the data (e.g., "what color" and "yes/no" questions), while their combination is scarce. Intuitively, we would expect a model with good generalization capacities to properly handle such augmentations. While this approach can promise coverage of only a subset of the examples in the VQA and VQA-CP datasets, it allows us to control the sources of the model's prediction errors.
We conduct extensive experiments and report three key findings. First, for three datasets, VQA, VQA-CP, and VisDial (Das et al., 2017), models with seemingly similar accuracy are very different in terms of robustness, when considering RAD with our CADs (Section 3). Second, we show that RAD with alternative augmentation methods, which prioritize coverage over focused intervention, cannot reveal the robustness differences. Finally, we show that measuring robustness using RAD with our CADs predicts the accuracy of VQA models on unseen augmentations, establishing the connection between robustness to our controlled augmentations and generalization (Section 4).

Robustness to Counterfactuals
In this section, we first present RAD (Section 2.1), which measures model consistency on questionanswer pairs and their augmented modifications. Then, we describe our template-based CAD generation approach (Section 2.2), designed to provide control over the augmentation process.

Model Robustness
We denote a VQA dataset with U = {(x v , x q , y) ∈ V × Q × Y}, where x v is an image, x q is a question and y is an answer. We consider a subset D ⊆ U for which we can generate augmentations. For an example (x v , x q , y) ∈ D, we denote an augmented example as (x v , x q , y ) ∈ D . In this paper we generate a single augmentation for each example in D, resulting in a one-to-one correspondence between D and the dataset of modified examples D . We further define J(D; f ) as the set of example indices for which a model f correctly predicts y given x v and x q .
RAD assesses the proportion of correctly answered modified questions, among correctly answered original questions, and is defined as, Note that RAD is in [0, 1] and the higher the RAD of f is, the more robust f is.
As original examples and their augmentations may differ in terms of their difficulty to the model, it is important to maintain symmetry between D and D . We hence also consider the backward view of RAD, defined as RAD(D , D; f ). For example, "yes/no" VQA questions are easier to answer compared to "what color" questions, as the former have two possible answers while the latter have as many as eight. Indeed, state-of-the-art VQA models are much more accurate on yes/no questions compared to other question types (Yu et al., 2019). Hence, if "what color" questions are augmented with "yes/no" counterfactuals, we would not expect RAD(D , D; f ) = 1 as generalizing from "yes/no" questions (D ) to "what color" questions (D) requires additional reasoning capabilities.
RAD is not dependant on the accuracy of the model on the test set. A model may perform poorly overall but be very consistent on questions that it has answered correctly. Conversely, a model that demonstrates seemingly high performance may be achieving this by exploiting many dataset biases and be very inconsistent on similar questions.
For example, consider the question-answer pair "What color is the vehicle? Red", this questionanswer pair can be easily transformed into "Is the color of the vehicle red? Yes". In general, "what color" questions can be represented by the template: "What color is the <Subj>? <Color>". To generate a new question, we first identify the subject (<Subj>) for every "what color" question, and then integrate it into the template "Is the color of the <Subj> <Color>? Yes". As the model was exposed to both "what color" and "yes/no" questions, we expect it to correctly answer the augmented question given that it correctly answers the original. Yet, this augmentation requires some generalization capacity because the VQA dataset contains very few yes/no questions about color.
Our templates are presented in Table 1 (see Table 6 in the appendix for some realizations). The augmentations are counterfactual since we intervene on the question type, a prior that many VQA systems exploit (Goyal et al., 2017), keeping everything else equal. The generation process is semiautomatic, as we had to first manually specify templates that would yield augmented questions that we can expect the model to answer correctly given  that it succeeds on the original question. To achieve this goal, we apply two criteria: (a) The template should generate a grammatical English question; and (b) The generated question type should be included in the dataset, but not in questions that address the same semantic property as the original question. Indeed, yes/no questions are frequent in the VQA datasets, but few of them address color (first template), number of objects (second template), and object types (third template). When both criteria are fulfilled, it is reasonable to expect the model to generalize from its training set to the new question type.
Criterion (a) led us to focus on yes/no questions since other transformations required manual verification for output grammaticality. While we could have employed augmentation templates from additional question types into yes/no questions, we believe that our three templates are sufficient for evaluating model robustness. Overall, our templates cover 11% of the VQA examples (Section 3.1).

Robustness with RAD and CADs
In the following, we perform experiments to test the robustness of VQA models to augmentations. We describe the experimental setup, and evaluate VQAv2, VQA-CPv2, VisDial models, each on our augmentations and on other alternatives.    on scene graphs attached to each image, and CS-ConVQA is manually generated by annotators. Finally, back-translation, translating to another language and back, is a high-coverage although lowquality approach to text augmentation. It was used during training and shown to improve NLP models (Sennrich et al., 2016), but was not considered in VQA. We use English-German translations. . We trained all the models using their official implementations.

Results
Table 2 presents our main results. RAD values for all of our augmentations are substantially lower than those of the alternatives, supporting the value of our focused intervention approach for measuring robustness. The high RAD values for BT and Reph might indicate that VQA models are indeed robust to linguistic variation, as long as the answer does not change. Interestingly, our augmentations also reveal that VQA-CP models are less robust than VQA models. This suggests that despite the attempt to design more robust models, VQA-CP models still overfit their training data.
In VQA-CP, RUBi has the lowest accuracy performance in terms of its validation accuracy, even though it is more robust to augmentations compared with LMH and CSS. For VQA models, in contrast, BUTD has the lowest RAD scores on our augmentations and the lowest accuracy. Visual-BERT, the only model that utilizes contextual word embeddings, demonstrates the highest robustness among the VQA models.
Finally, while both VisDial models have similar accuracy, they have significantly different RAD scores on our augmentations. Specifically, VisDi-alBERT performs better than FGA on Y/N C augmentations. This is another indication of the value of our approach as it can help distinguish between two seemingly very similar models.
Complementary to the RAD values in Table 2 we also provide accuracies on original questions in Table 3. Note that across all the original questions, except ConVQA questions, RUBi has the lowest accuracy while CSS has the highest accuracy. This trend is reversed when looking at RAD scores -CSS has the lowest score while RUBi has the highest score. This emphasizes the importance of RAD as a complementary metric, since considering only accuracy in this case would be misleading. Namely, RAD provides additional critical information for model selection.

Measuring Generalization with RAD
To establish the connection between RAD and generalization, we design experiments to demonstrate RAD's added value in predicting model accuracy on unseen modified examples. Concretely, we generate 45 BUTD (VQA) and LMH (VQA-CP) instances, differing by the distribution of question types observed during training (for each model instance we drop between 10% and 99% of each of the question types "what color", "how many" and "what kind" from its training data; see Appendix E for exact implementation details). For each of the above models we calculate RAD values and accuracies in the following manner.
We split the validation set into two parts: D (features) and T (target). Consider a pool of four original question sets that are taken from their corresponding modifications: Y/N C, Y/N HM, Y/N WK, Reph. Then we have four possible configurations in which D is three sets from the pool and T is the remaining set. For each model and for each configuration, we compute model accuracy on D (Accuracy(D)) and on the modifications of questions in T (the predicted variable y(T ) = Accuracy(T )) which are modified with the target augmentation of the experiment. We also compute the RAD values of the model on the modified questions in D which are generated using the other three augmentations (RAD(D, D ), and RAD(D , D)). Then, we train a linear regression model using Accuracy(D), RAD(D, D ), and RAD(D , D), trying to predict y(T ). We perform this experiment four times, each using a different configuration (different augmentation type as our target), and average across the configurations.  Results Table 4 presents the average R 2 values and standard deviations over the four experiments. RAD improves the R 2 when used alongside the validation accuracy. Interestingly, a model's accuracy on one set of augmentations does not always generalize to other, unseen augmentations. Only when adding RAD to the regression model are we able to identify a robust model. Notably, for LMH the usefulness of RAD is significant, as it improves the R 2 by 11%. It also predicts performance better than validation accuracy when used without it in the regression. The standard deviations further confirm that the above claims hold over all configurations. These observations hold when running the same experiment with respect to the BUTD model, however, the improvements are smaller since the regression task is much easier with respect to this model (R 2 of 0.995 with all features).

Conclusion
We proposed RAD, a new measure that penalizes models for inconsistent predictions over data augmentations. We used it to show that state-of-theart VQA models fail on CADs that we would expect them to properly address. Moreover, we have demonstrated the value of our CADs by showing that alternative augmentation methods cannot identify robustness differences as effectively. Finally, we have shown that RAD is predictive of generalization to unseen augmentation types. We believe that the RAD measure brings substantial value to model evaluation and consequently to model selection. It encourages the good practice of testing on augmented data, which was shown to uncover considerable model weaknesses in NLP (Ribeiro et al., 2020). Further, given visual augmentations, which we plan to explore in future work, or linguistic augmentations, RAD is applicable to any classification task, providing researchers with meaningful indications of robustness.

A Dataset Statistics
Please see Table 5 for the number of examples in each dataset that we use (VQA, VQA-CP and VisDial). We also report the number of augmentations we produce for each of our three augmentation types (Y/N C, Y/N HM and Y/N WK), alongside previous augmentation approaches used in our experiments (BT, Reph, L-ConVQA and CS-ConVQA).

B Our Augmentations
We describe the manual steps required to meet the desired standard for each augmentation type. For Y/N C, we filter out questions that start with "What color is the". For Y/N HM, we use questions that starts with "How many". For Y/N WK, we consider questions that match the pattern "What kind of <S> is this? <O1>". Table 6 presents several realizations of the templates we define (see Section 2.2 for a discussion of these templates).
In Y/N HM, we ensure that when the answer is '1', we use "Is there ..." instead of "Are there ...". We also ensure that the subsequent word to "How many" is a noun. We verify it is a noun using the part-of-speech tagger available through the spaCy library (Honnibal et al., 2020).
We allow the generation of both 'yes' and 'no' answers. Creating a modified question that is answered with a 'yes' requires a simple permutation of words in the original question-answer pair, e.g., for Y/N C, take "<C1>" = "<C2>" (see Table 1). Similarly, to generate a question that should be answered with a 'no', we repeat the above process and change "<C2>". In this case, we randomly pick an answer and replace it with the original answer with probability weighted with respect to the frequency in the data, among the pool of possible answers for the given augmentation type. When generating a new question, we first randomly decide whether to generate a 'yes' or 'no' question (with a probability of 0.5 for each). Then, for example, if we choose to generate a 'no', and "<C1>" = "red", we have a 63% chance of having "<C2>" = "blue".

D Model Settings
We have trained the VQAv2 and the VQA-CPv2 models that we use, as pre-trained weights were not available for our requirements. For our evaluations, we require a model that is trained solely on the VQAv2 train set, such that we match the VQA-CPv2 settings, where there are only two sets, train and validation. In contrast, pre-trained models that are built for VQAv2 are trained on the VQAv2 training set and on the VQAv2 validation set together, as the dataset contains a third development set that is commonly used for validation. We have trained six VQA models using the default hyper-parameters from their official implementations (URLs in Appendix C): RUBi, LMH, CSS, BUTD, BAN and Pythia. We trained the above models on a single Nvidia GeForce RTX 2080 Ti GPU, when the training time for each of the models was less than 12 hours. In addition, inference in this setting took less than an hour for all models.
The VisualBERT model is more computationally intensive, and we had to reduce the default batch size from 480 to 54 to fit it on our resources. Using three Nvidia GeForce RTX 2080 Ti GPUs for VisualBERT, training took 36 hours and inference took 4 hours.
For the VisDial models, FGA, and VisDialBERT, we have downloaded the pre-trained weights and used them solely for inference. On a single Nvidia GeForce RTX 2080 Ti GPU, inference took 15 minutes for FGA, and 8 hours for VisDialBERT.
All the models we consider have less than 200M parameters.
When accuracies are reported on VQAv2 and on VQA-CP (Tables 2 and 3) we use the VQAaccuracy metric (Antol et al., 2015). For VisDial we use the standard accuracy metric (denoted originally as R@1).
We split the validation set into two parts: D and T . D is used to calculate the features in our linear regression model. We denote with D 1 the questions in D that can be modified using the Y/N C augmentation, after these questions were modified. Similarly, we define D 2 , D 3 , and D 4 for Y/N HM, Y/N WK, and Reph, respectively.
We average the R 2 of four linear regression experiments, when in each experiment we set a different i (i ∈ {1, 2, 3, 4}) for which T = D i and use the remaining three templates to calculate our features. We denote the regression features with x 1 = Accuracy(D), x 2 = RAD(D, D ), and x 3 = RAD(D , D), where RAD(D, D ) and RAD(D , D) are computed with respect to the three other templates (j ∈ {1, 2, 3, 4}, j = i). The predicted label is y(T ) = Accuracy(T ).
Thus the equation for our regression is: We also perform three regression experiment for each feature alone: y(T ) = bx k + , k = 1, 2, 3 , and compare the results of these experiments in Table 4.