ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos

Multimodal counterfactual reasoning is a vital yet challenging ability for AI systems. It involves predicting the outcomes of hypothetical circumstances based on vision and language inputs, which enables AI models to learn from failures and explore hypothetical scenarios. Despite its importance, there are only a few datasets targeting the counterfactual reasoning abilities of multimodal models. Among them, they only cover reasoning over synthetic environments or specific types of events (e.g. traffic collisions), making them hard to reliably benchmark the model generalization ability in diverse real-world scenarios and reasoning dimensions. To overcome these limitations, we develop a video question answering dataset, ACQUIRED: it consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints, which ensures a focus on real-world diversity. In addition, each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal, which can comprehensively evaluate the model counterfactual abilities along multiple aspects. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap (>13%) between models and humans. The findings suggest that multimodal counterfactual reasoning remains an open challenge and ACQUIRED is a comprehensive and reliable benchmark for inspiring future research in this direction.


Introduction
Multimodal counterfactual reasoning refers to the ability to imagine and reason about what might have happened if certain conditions were different * The authors contribute equally.
If he tested the weight of the dolly before pushing it, it wouldn't have dragged him down the ladder.
If he tested the weight of the dolly before pushing it, he would have pushed it harder down the ladder.

Temporal
I could get my hands stained with the red sauce.

What if I wasn't wearing gloves?
I could accidentally cut myself with the cutter.

Physical 3rd-Person 1st-Person
What if the man tested the weight of the dolly before pushing it?
Figure 1: The ACQUIRED dataset is a video question answering (QA) dataset that specifically focuses on counterfactual reasoning on diverse real-world events.Our dataset concerns three types of commonsense reasoning dimensions: physical, social, and temporal, and encompasses videos from both third-person (upper) and first-person (lower) viewpoints.Each question is curated with a correct and a distractor answer.Each answer is by itself individually judgeable, and hence our dataset can be approached in either binary True/False or multiple-choice setting.
from what actually occurred based on vision and language inputs.It involves mentally simulating alternative scenarios and evaluating their potential outcomes.This cognitive process plays a crucial role in human intelligence, as it allows us to understand causality, make predictions, and learn from past experiences.For AI models, developing the capacity for counterfactual reasoning is a significant area of research and a challenging task.By enabling AI models to engage in counterfactual reasoning, we can enhance their understanding of causal relationships and their ability to assess the impact of interventions or changes in conditions.
However, despite the significance of counterfactual reasoning, it remains a relatively unexplored area of research.To assess the overall reasoning capabilities of models, several visual question answering datasets have been proposed on both images (Antol et al., 2015;Johnson et al., 2017) and videos (Yi et al., 2020;Xu et al., 2021).These datasets require reasoning skills such as commonsense reasoning, extracting human/object-to-object relations, and inferring physical properties.
One specific dataset in the realm of counterfactual reasoning is CLEVRER (Yi et al., 2020), which generates synthetic videos and associated questions in a controlled environment, featuring simulated object motion and rendered video frames.This dataset allows for evaluating models using descriptive, explanatory, predictive, and counterfactual questions, covering a wide range of reasoning scenarios.However, the data generation process in CLEVRER is overly synthetic, limiting its effectiveness to assess models' counterfactual reasoning abilities in realistic contexts.To address this limitation, TrafficQA (Xu et al., 2021) focuses on real-world traffic event cognition and reasoning in videos, specifically targeting scenarios like traffic accidents.It leverages crowdsourcing to gather diverse types of questions, including fundamental comprehension, counterfactual inference, and event forecasting.Nevertheless, because TrafficQA concentrates solely on traffic events, it fails to encompass other real-life events, resulting in a substantial domain gap between TrafficQA and general video datasets such as Kinetics (Kay et al., 2017;Smaira et al., 2020) and YouTube (Abu-El-Haija et al., 2016;Zellers et al., 2022).
In this paper, we construct a benchmark that can evaluate the counterfactual reasoning abilities of visual models on various kinds of real-world events.We introduce ACQUIRED 1 that covers multiple dimensions of counterfactual reasoning and includes videos of both egocentric and exocentric views.Specifically, based on videos in both Oops (Epstein et al., 2020) and Ego4D (Grauman et al., 2022), we crowd-source 11K questions over 3.7K videos targeting physical, temporal, and social counterfactual reasoning.Both the Oops and Ego4D datasets consist of human activities and interactions in nu-1 Abbreviation of: Answering Counterfactual Questions In Real-Life Videos merous settings, making them ideal sources for curating video question answering datasets.In addition, many videos contain unintentional human actions (e.g., the person accidentally falling down the ladder in Figure 1), which naturally enables people to come up with diverse what-if questions.
Inspired by Singh et al. (2021), we adopt a similar methodology for gathering counterfactual questions.Each question consists of a pair of answers, with one being the correct response and the other serving as a distractor.Importantly, the distractor answer represents a minimal contrastive counterpart to the correct answer.As we can see from examples in Figure 1, the design of using complementary pairs requires the model to understand the subtle differences between different options, which ensures that the model exhibits an intuitive grasp of counterfactual reasoning.In addition, having one distractor for each question allows for testing models in either True/False or multiple-choice setting.
We extensively evaluate numerous strong language models such as GPT-4, as well as state-ofthe-art video-language models such as VALOR on our ACQUIRED dataset.The experimental results suggest that models struggle to effectively utilize the video contexts and perform counterfactual reasoning, with multimodal models achieving only comparable and sometimes inferior performance than language-only models.Moreover, the significant gap between the human and model (>13%) performance highlights the challenging nature of our task and room for improvements in visual counterfactual reasoning.

Related Work
We will overview three lines of relevant research to this work: visual question answering, visual understanding models, and counterfactual reasoning.Visual Question Answering Datasets.In Table 1, we list several representative visual QA datasets as well as their key features.The Visual Question Answering (VQA) dataset (Antol et al., 2015) is one of the pioneering works in this direction and has been a standard benchmark for evaluating the reasoning ability of image-language models (Goyal et al., 2017).Follow-up datasets such as CLEVR (Johnson et al., 2017) and GQA (Hudson and Manning, 2019) automatically construct compositional questions over real or synthetic images and perform the evaluation in a systematic way.To further evaluate the commonsense reasoning ability of models, Video QA datasets CLEVRER (Yi et al., 2020) Synthetic Object Collision Automatic ✓ ✓ ✗ ✓ VLEP (Lei et al., 2020) TV & YouTube Human ✓ ✗ ✗ ✗ MovieQA (Tapaswi et al., 2016) Movie Human ✓ ✓ ✓ ✗ MSRVTT-QA (Xu et al., 2017) Diverse Real-world Event Automatic ✓ ✗ ✗ ✗ TGIF-QA (Jang et al., 2017) Tumblr GIF Automatic & Human ✓ ✓ ✗ ✗ MarioQA (Mun et al., 2017) Gameplay Video Automatic ✓ ✓ ✗ ✗ TVQA (Lei et al., 2018) TV Human ✓ ✓ ✗ ✗ Social-IQ (Zadeh et al., 2019) YouTube Human ✗ ✗ ✓ ✗ TrafficQA (Xu et al., 2021) Traffic Event Human ✓ ✓ ✗ ✓ NExT-QA (Xiao et al., 2021) Diverse Real-world Event Human ✓ ✓ ✗ ✗ Causal-VidQA (Li et al., 2022) Diverse VCR (Zellers et al., 2019) crowd-sources commonsense question-answer pairs associated with rationales over static images extracted from movies.Video question answering is more challenging than image question answering and is gaining increasing attention from the research community, leading to several video QA datasets being constructed (Lei et al., 2020;Tapaswi et al., 2016;Xu et al., 2017;Jang et al., 2017;Mun et al., 2017;Lei et al., 2018).Among them, CLEVRER (Yi et al., 2020) improves upon CLEVR and uses programmatically generated videos capturing collisions of synthetic objects to evaluate the model reasoning abilities along multiple dimensions.Social-IQ (Zadeh et al., 2019) and TrafficQA (Xu et al., 2021) employ videos depicting real-world events, wherein Social-IQ primarily emphasizes human social interactions, while Traf-ficQA focuses on traffic events and accidents.To improve the diversity of the captured events, NExT-QA and Causal-VidQA collect videos from diverse domains and have human-annotated questions targeting different dimensions of reasoning.
As can be seen in Table 1, among all the visual QA datasets, there are only a few that attempt to evaluate the counterfactual reasoning abilities of models.In addition, the existing benchmarks are often limited in terms of the video sources and the question types, making it difficult to evaluate the model performance in a diverse real-world setting.ACQUIRED is the first dataset that can comprehensively evaluate the model counterfactual reasoning abilities spanning three distinct dimensions (i.e., physical, social, and temporal) and cover videos that include a wide range of event types and from different viewpoints.
Visual Understanding Models.The creation of visual QA benchmarks allows for the development of visual understanding models.Many of the previous works have tried to solve these tasks using compositional approaches and scene graphs (Santoro et al., 2017;Hu et al., 2017;Hudson and Manning, 2018;Perez et al., 2018;Yi et al., 2018;Shi et al., 2019;Gao et al., 2020;Ding et al., 2021).For example, Hu et al. (2017) propose to train a modular network in an end-to-end manner to achieve both effectiveness and interpretability; Hudson and Manning (2018) utilize scene graphs and perform differentiable neural operations on the graphs to perform visual reasoning.Inspired by the success in pretraining on Internet-scale data (Devlin et al., 2019), pretraining models on large vision and vision-language tasks and then finetuning them on specific downstream tasks has become a standard in tackling visual understanding tasks (Sun et al., 2019;Li et al., 2020;Zhu and Yang, 2020;Lei et al., 2021;Zellers et al., 2021;Fu et al., 2021;Zellers et al., 2022;Wu et al., 2022).Existing works in this direction generally train models on large vision-language datasets with objectives such as masked language modeling and video-text matching.Despite the great progress in this direction, it is unclear if these models can perform counterfactual reasoning.To address this, we benchmark ACQUIRED against state-of-the-art models and systematically study their performance.
Causal and Counterfactual Reasoning.Humans can infer how an event would have unfolded differ-ently without experiencing this alternative reality and it has been a long-standing research topic in cognitive psychology (Van Hoeck et al., 2015).To empower such an important ability to artificial intelligence, researchers have tried to build learning models that can infer causal relations and perform reasoning in various fields (Qin et al., 2019;Yi et al., 2020;Baradel et al., 2020;Abbasnejad et al., 2020;Yue et al., 2021;Wang et al., 2021).Our constructed benchmark provides a valuable resource for developing and evaluating visual models with counterfactual reasoning abilities.
3 The ACQUIRED Dataset

Dataset Design & Collection
Problem Definition.As illustrated in Figure 1 and Table 2, each data point in ACQUIRED consists of a video and corresponding annotated question and answer pairs.We are inspired by prior works (Clark et al., 2019;Singh et al., 2021) to consider the surprisingly difficult nature of the T/F (yes/no) QA formats that could potentially exhibit less unintended biases/artifacts than curating data in the multiple choice (MCQ) settings.In light of this, for each question, we collect one correct and one distractor answer (which can be a slightly perturbed version of the correct one), where both of which are individually judge-able by themselves respectively.And hence, our dataset can be approached as a binary True/False (T/F) prediction task as well as a multiple-choice (MCQ) (2 choices in this case) question answering task.
It is worth noting that the distractors in our dataset are manually curated with certain twists towards the correct answers (examples in Table 2), forcing the models to truly understand the visual concepts involved in the counterfactual questions in order to answer correctly.
In Section 5.2, we will describe our adoption of a pairwise consistency metric that requires the model to answer correctly in both correct and distractor directions to be regarded as a success, in order to reduce the models' exploiting surface-level heuristics to predict the answers.

Commonsense Dimensions.
We adopt the commonsense knowledge categorization proposed in (Singh et al., 2021), which is inspired by the Theory of Core Knowledge (Spelke and Kinzler, 2007) 2 , to collect QAs that focus on the following 2 The capability of reasoning about physical objects, places, three dimensions: physical, social, and temporal.The physical dimension concerns the knowledge of objects involved in the events and their properties (e.g., shape, size, functionalities, affordances), as well as the motion and location of the events.The social dimension looks at human social behaviors, particularly attributes such as personality, emotions, inner interests/intentions, and social activities. 3The temporal dimension regards the aspects of events/activities in their temporal orderings, duration, and frequency/speed of motions.
The three main dimensions are the building blocks towards a comprehensive commonsense reasoning, and helps systematically analyze in which aspects the models need to be improved upon more.Although some questions can be answered using more than one commonsense dimension, we ask the annotators to label with the main one used.
Video Resources & Sampling.We utilize the Oops! (Epstein et al., 2020) dataset for third-person view videos and Ego4D (Grauman et al., 2022) for first-person views, where both of which feature text descriptions of the video contents.Oops! concerns predicting the failing (oops) moment of an intended action in a video, and hence is event-rich and a good testbed for reasoning what could the outcomes turned out differently.Ego4D collects videos of humans performing daily activities in the first-person view, which adds a desirable taskknowledge layer on top of its event-richness.
As we are annotating subsets of videos from the aforementioned sources, we have the privilege to encourage a more balanced key events distribution from the videos to be annotated.Specifically, we (1) use NLP tools such as semantic role labeling (SRL) to extract key verbs (events) for each video description4 , and group the videos accordingly, (2) each time sample an event group with a probability inverse proportional to the current launched key event distribution, (2) sample a video from the event group in (3), and repeat until reaching a desired number of videos (to be annotated).
The sampling strategy, combined with our predefined reasoning dimensions and video domains, is designed to improve the diversity of questionanswer pairs.Collection Workflow.We collect our dataset via motions, and the social world.Amazon Mechanical Turk (MTurk).Each MTurk worker is asked to carefully watch a given video for creating the QA pairs.As depicted in Figure 2 Integrating the model-in-the-loop protocol into the pipeline not only brings benefits in curating more challenging samples, but also helps diversify the answers as the models will not be easily fooled if there are similar patterns existed in the dataset or the questions can be simply guessed without visual inputs.Quality Validation.In order to further ensure the sample quality as well as summarize common mistakes to provide custom human feedback to the annotators, our internal members conduct the second-phase manual sample validation in conjunction with the deployed model results.We crossvalidate the annotations among our internal members in the ramping-up phase to ensure quality.We also accumulate detailed guidelines from our manual validation process for providing effective feedback.After scaling up, we continue to validate the annotations via uniform subsampling across each annotator.Our validation criteria are well aligned as can be seen in the high 0.85 Kappa score for commonsense dimension agreements; and 0.91 overlapping ratios for video relevancy.Validation Analysis.Table 3 reports the data droprates (majority voted to drop by all three validators) for the first 5 batches.We hope these rigorous safety checks can ensure a good data quality that also closely follows our guideline, and the validation should by no means introduce unnecessary biases as we indeed saw a decrease in the dropping rates in our later collection batches.

Dataset Statistics
General Statistics.

Benchmarking Models
We benchmark our dataset with both state-ofthe-art language-only and vision-language models.Specifically, we perform experiments with De-BERTa (He et al., 2021), UnifiedQA (Khashabi et al., 2020), VIOLET (Fu et al., 2021), VALOR (Chen et al., 2023), and VL-Adapter (Sung et al., 2022) on our dataset.Language-Only Models.While ACQUIRED is a multimodal dataset that has both vision and language inputs, previous works (Thomason et al., 2019) have pointed out that unimodal models can sometimes achieve surprisingly strong performance because of the annotation bias.Therefore, we evaluate both DeBERTa-v3 (He et al., 2021) and the UnifiedQA model family (Khashabi et al., 2020) (state-of-the-art question answering models based on the T5 architecture (Raffel et al., 2020)) on our dataset, which can reflect the dataset biases and provide an important reference point for multimodal models.The language-only models answer the textual questions without looking at the videos.
Inspired by the superior performance of the recent large language models, i.e., the GPT model from OpenAI, we also evaluate its zero-shot perfor-  mance on the textual parts of our dataset.Specifically, we consider both ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b).In addition, we further include a version of GPT models that can condition on pre-annotated descriptions describing the general contents of the videos, to serve as the pseudo visual (and situated) contexts of the questions.Details on how we prompt the GPT family are in Append.Sec.C.1.
VIOLET (Fu et al., 2021).VIOLET is a videolanguage model that has three components, including a video encoder (Swin Transformer-base (Liu et al., 2022)), a language encoder (BERT-base (De-vlin et al., 2019)), and a cross-modal transformer module that performs cross-modal fusion.The video and language encoders extract features from the video and language inputs respectively, and the extracted features are then fed into the crossmodal transformer for cross-modal interactions.VIOLET is pretrained on large-scale video-text data with masked language modeling that predicts the original word tokens given the masked inputs, masked visual-token modeling (MVM) that recovers the masked video patches conditioned on the unmasked video and language inputs, visual-text matching that aims to align the paired video-text inputs between video and text modality.
VALOR (Chen et al., 2023).VALOR is a recently proposed multimodal model that can take video, language, as well as audio as inputs.Similar to VIOLET, VALOR also first encodes vision, audio, and text inputs separately, and the encoded features are then fed into a multimodal decoder for text generation.VALOR demonstrates strong performance across a wide range of tasks, including video retrieval, video captioning, video question answering, audio-visual captioning, text-to-audio retrieval, and audio captioning.
VL-Adapter (Sung et al., 2022).VL-Adapter uses a pretrained vision encoder (e.g.CLIP (Radford et al., 2021)) to extract vision features and feed the vision features as well as text tokens to a pretrained language model (e.g.T5 (Raffel et al., 2020)) so that the model can take both vision and language information.When adapting the model for downstream tasks, because it can be costly to finetune all the model parameters, VL-Adapter investigates different adapter-based parameter-efficient finetuning strategies and demonstrates that training the adapter allows them to only update a rather small portion (e.g.4%) of total parameters and match the performance of finetuning the entire model.Because VL-Adapter supports different combinations of pretrained vision and language encoders, we employ different versions of CLIP-ViT-B/16 and UnifiedQA-Large as the vision and text encoders.

Training and Implementation Details
We obtain the pretrained weights of all the benchmarking models from their respective open-sourced releases and finetune them on our official training data split.The hyperparameters are manually tuned for each model, and the checkpoints used for testing are selected by their validation performance.

Experimental Setup
Data Splits.For our official (to-be-released) dataset, we follow a 45 − 5 − 50 ratio and randomly split the train-development-test datasets.The train split is mainly to adapt models to our QA task settings as well as the counterfactual reasoning style.We ensure that there are no overlaps between videos of different sets and the Oops! and Ego4D videos are equally distributed in each of the splits.
Evaluation Metrics.Models are evaluated by a simple accuracy metric, for both T/F and MCQ settings.We also further ablate the model performance along the commonsense dimensions and/or viewpoints, for a more detailed performance breakdown and analysis.We also include the pairwise accuracy in the T/F setting following Singh et al. (2021), where the model is considered correct if both individual judgments are correct in each pair.
Training Details.All the models in this work are trained on multi (at least 2-4) Nvidia A100 GPUs6 on a Ubuntu 20.04.2 operating system.
We train our models until performance convergence is observed on the training split (determined by the development set performance).All of the hyperparameters are manually tuned and searched, with multiple trials for better performance and training convergences.

Experimental Results
Table 6 reports benchmark performance.The bestperforming multimodal model (VL-Adapter) performs slightly better than its text-only counterparts, UnifiedQA-large (i.e., the language encoder of our VL-Adapter).While this shows that visual contexts and multimodality are effective, the performance gap is not substantial; therefore, there is room for improvement, and more effective methods of multimodal inputs are yet to be explored.While text-only UnifiedQA-3B achieves overall better performance in both T/F and MCQ settings, potentially due to its much larger learnable parameter space, its mediocre pairwise accuracy suggests that the model is still inept at robust counterfactual reasoning in the two facets of the same question.
In general, models perform better in the MCQ settings than the T/F ones.This is intuitive because in the MCQ settings, the model is aware that only one of the two given options is correct and only needs to compare them and select the more reasonable option.(Such a phenomenon is also studied/discovered in (Clark et al., 2019;Singh et al., 2021)) In the case of ChatGPT, its MCQ setting accuracy is lower than that of the T/F setting compared to others.We suspect that ChatGPT might have a weaker reasoning ability compared with GPT4.We observe that often ChatGPT refuses to give an answer in the MCQ settings because of insufficient conditions while it leans towards false when it was asked the same question in a T/F setting.
Perhaps surprisingly, despite the remarkable capabilities of the GPT series, they do not perform as impressively, even when provided with descriptions transcribing the major visual events in the videos.This suggests that the annotators in our curation task indeed closely examine many visual details in order to create more challenging samples.
Human Performance.We randomly sub-sample 500 videos to estimate human performance: these are reported in the last two rows of Table 6.The human performance highlights a significant gap above all the model results, especially for the MCQ settings.We hope future modeling endeavors can close the gap in visual counterfactual reasoning.
Commonsense Dimensions.The rightmost parts of Table 6 report the performance breakdown along commonsense reasoning dimensions.We observe a general trend: most of the models perform better in physical and social dimensions compared to the temporal dimension; the physical dimension generally exhibits the highest performance.That observation implies that, even after being finetuned on our dataset, the models still fall short of capturing temporal commonsense as opposed to the other two kinds of knowledge.This can also be hypothetically attributed to the fact that the pretraining data for the language models encapsulate more physical and/or human social knowledge.Viewpoints.We take the best-performing multimodal model (VL-Adapter) and ablate its performance along different video viewpoints.We find that, despite being pretrained mostly on thirdperson viewpoint videos, the generalization ability of the models towards first-person viewpoints is sufficiently good.However, as the videos from Ego4D are not intended to explicitly contain failed actions from the camera wearers, it could be more challenging for our annotators to construct diverse and subtle counterfactual questions as compared to the videos from Oops!.Nevertheless, we argue that the counterfactual reasoning ability of the models should be equally crucial regardless of video viewpoint, and our dataset can inspire relevant research serving as a first-of-its-kind counterfactual video QA encapsulating videos from varying viewpoints.

Conclusions
In this work, we present a novel counterfactualreasoning-focused video question answering dataset, named ACQUIRED.The dataset provides questions about counterfactual hypotheses over visual events (videos).We collect a correct and a distractor answer for three commonsense reasoning dimensions: physical, social, and temporal.We benchmarked various state-of-the-art language models (including LLMs like GPT) and videolanguage models on the collected dataset, where the results demonstrate algorithm performance well below human performance (>13% accuracy).We hope our studies and the collected ACQUIRED dataset can spur relevant future research, specifically on testing multimodal models' capabilities in counterfactual reasoning, devising assistive AI for remedial and/or cause estimation of observed failures, and more sophisticated visual event understanding and reasoning.

Limitations
We hereby discuss the potential limitations of our work: (1) Our work focuses on the three commonsense dimensions: physical, social, and temporal.While they likely span the most common types of the reasoning technique, there could be more, e.g., numerical commonsense is not specifically dealt with in this work, nor is non common activities such as fantasies and fictions involved.For future models benchmarked against our dataset, this should be taken account for, i.e., should the models excel at these commonsense dimensions for counterfactual reasoning, we cannot guarantee it is a complete model on all types of reasoning scheme.
(2) The videos used in this work are subsets of readily collected ones from both Oops! (Epstein et al., 2020) and Ego4D (Grauman et al., 2022) mother sets, and hence the event distribution can be bounded by the activities they concern.While we argue that the dataset is, to our best knowledge, first of its kind video QA dataset in terms of diversity and dedication of counterfactual reasoning, the video resources spanning even more diversified situations can be further extended.We will release the manuscripts and our collection tools to help spur future relevant research in such endeavours.
(3) Unlike Oops!, there is not an obvious failed actions occurred in Ego4D, and hence the annotated questions could be confounded by more imagined situations.We argue that the required reasoning technique is essentially the same and the models learn on our dataset should generalize well to situations that actually involve failing actions from egocentric visual contexts.However, we encourage future research to extend the first-person viewpoint (egocentric) parts to encompass obvious failing actions to collect just-in-time assistive questions and their corresponding remedial responses.

Ethics and Broader Impacts
We hereby acknowledge that all of the co-authors of this work are aware of the provided ACL Code of Ethics and honor the code of conduct.This work is mainly about collecting a sizable video question answering dataset that mainly focuses on counterfactual reasoning abilities and systematically probing such capability from state-of-the-art multimodal and large language models.Dataset.We collect the human annotation of the question-answer pairs to a prompted video via Amazon Mechanical Turk (MTurk) and ensure that all the personal information of the workers involved (e.g., usernames, emails, urls, demographic information, etc.) is discarded in our dataset.Although for a single same video, there is only one single worker annotating the corresponding questions, we ensure that the similar type of events get annotated by as diverse worker pools as possible during the collection phase.We do not foresee much unintended biases within the annotations as they should focus on the contents of the videos for physical and event temporal domains, and we make efforts on reducing the potential biases on the social domain by providing periodic email-based training to the workers to diversify the creations.
This research has been reviewed by the IRB board and granted the status of an IRB exempt.The detailed annotation process (pay per amount of work, guidelines) is included in the appendix; and overall, we ensure our pay per task is above the the annotator's local minimum wage (approximately $15 USD / Hour).We primarily consider English speaking regions for our annotations as the task requires certain level of English proficiency.
Techniques.We benchmark the proposed video counterfactuality reasoning task with the stateof-the-art large-scale pretrained language models (both language-only and multimodal).As the counterfactual commonsense reasoning and its understanding are of our main focus, we do not anticipate production of harmful outputs, especially towards vulnerable populations, after training (and evaluating) models on our proposed task.

A Details of The Dataset
Our dataset consists of a mixture of QA pairs collected from two data sources: Ego4d and Oops!.For each dataset split, we create an indexing .jsonfile and summarize each QA instance with a video id (index), a domain (physical/social/temporal), a type (counterfactual), a question, a correct answer, a distractor, and a key to the correct answer and a video link URL.Our official data release will encompass all the aforementioned essential fields.

A.1 Dataset Splits
We split our data into train/val/test based on the ratio 0.45/0.05/0.5, with each unique video only appearing in one split.

A.2 Word Distributions
Figure 4a and Figure 4b plot the most frequent verbs (mainly for events) and nouns (mainly for entities) distributions of the Oops! proportion of our dataset, while Figure 5a and Figure 5b plot the ones of the Ego4D proportions.
Figure 6a and Figure 6b are distributions of the CLEVRER dataset.Figure 7a and Figure 7b are distributions of the TrafficQA dataset.It can be seen from these charts, alongside Table 7 and Table 8 that the event types in both datasets are quite unimodally towards their original intended domains (which is reasonable), with all four ratios much lower than those of our dataset.We use an internal validation interface with a question-answering setting to accept or reject a sample.This tool also allows us to fix wrong domain categorization and T/F labels annotated by the workers.Specifically, the validation questions include:  • Should we discard this question group from our dataset (repetitive / not fixable at all)?
• Does this question group need any editing to reduce ambiguity or to further fool the model?
• Check the T/F of the two sentences.
• Select the domain that you think this question group can be categorized into.
• select one of the type that you think this question group can be categorized into.
• To answer this question, do you need to refer to the video?
• Does this question group conform to our question format?
, our dataset collection process comprises four main steps: (1) We design a qualification questionnaire focusing on examining one's understanding of the key concepts in our problem design, i.e., the concept of counterfactuality, the requirement of video relevancy, common sense reasoning dimensions, and what types of QA pairs are more desirable.(2) Once the workers pass the qualification test, they are directed to an interface where a pretrained (text-only) QA model is deployed in the loop of the QA creation process.Bonus monetary rewards are given if the deployed model fails to predict correctly the creations.(3) Internal members then conduct a quality validation on the created samples and provide customized tips and/or feedback to the workers for potential improvements.(4) Lastly, our deployed model is iteratively finetuned on the validated samples after each batch of annotations, which results in a constantly improved model to incentivize more challenging sample creations.
(a) Top-40 frequent verbs in Oops! part of ACQUIRED.(b) Top-40 frequent nouns in Oops! part of ACQUIRED.

Table 1 :
Comparisons of different visual question answering datasets.ACQUIRED is the first to feature all the dimensions.

Table 2 :
Sample data points of the ACQUIRED dataset.

Table 3 :
Batches Annotation Drop-Rate (%) Number of Videos Annotation drop rate for the first 5 batches.Each video gives 3 pairs of question -correct/distractor answers.

Table 4 :
General statistics of the two video domains.

Table 5 :
Deployed model fooling rates during collection.
(Grauman et al., 2022)ssential statistics of the collected dataset, where Table4ais for videos obtained from the Oops!(Epstein et al., 2020)dataset whereas Table4bis for videos from Ego4D(Grauman et al., 2022).The frame-Deployed Model.Table5reports the model fooling rates in our collected data across the two data sources.We encourage our annotators to develop QA pairs that can successfully fool our model by setting up monetary rewards and unlimited trials.

Table 6 :
Model benchmarking performance on our ACQUIRED dataset.