ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

We introduce {dataset, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from {charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.

One of the ultimate goals of Artificial Intelligence is to build intelligent agents capable of accurately understanding humans' actions and intents, so that they can better serve us (Kong and Fu, 2018).Newly emerging applications in robotics and multi-modal planning, such as Amazon Astro, have demonstrated a strong need to understand human behavior in multimodal environments.On the one hand, such an agent, e.g. an elderly care service bot, needs to understand human activities and anticipate human behaviors based on users' intents.
Here the intents may be estimated based on previous activities or articulated verbally by users.The anticipated behaviors may be used for risk assessment (e.g.falling of elderly people) and to facilitate collaboration with humans.On the other hand, recent advances in robotics show that it is possible to let robots learn new tasks directly from observed human behavior without robot demonstrations (Yu et al., 2018;Sharma et al., 2019).However, that line of work focuses on imitating observed human actions without anticipating future activities.
To promote research on action forecasting based on intents, we propose the vision-language planning task for human behaviors.As shown in Fig. 1, given an intent in textual form and a short video clip, an agent anticipates which actions a human is likely to take.We consider intents as given because there is already ample research on intent identification (Pandey and Aghav, 2020) and automatic speech recognition (Malik et al., 2021).To the best of our knowledge, there is no dataset to evaluate models for this task.
The task poses two major challenges.First, there are often multiple plausible action sequences satisfying an intent.Second, it is highly unlikely that a training dataset can cover all possible combinations of actions for a given intent.Hence, models need to acquire compositional generalization (Fodor and Pylyshyn, 1988), the capability to generalize to unseen action sequences composed of known actions.
In this work, we construct a dataset called ViLPAct for Vision-Language Planning of human Activities, which to the best of our knowl-edge is the first dataset studying the above challenges.Specifically, we extend the Charades dataset (Sigurdsson et al., 2016) with intents via crowd-sourcing.As it is practically infeasible to find all possible future action sequences given an intent and a video clip of initial activities, we propose to evaluate all systems by letting each of them answer multi-choice comprehension questions (MQA) without training them on those questions.Given an intent and a video clip showing initial activities, each multi-choice question provides a fixed number of future action sequences as possible answers.A system is then asked to select the most plausible action sequence among them.We show that the rankings of all models using the MQAs correlate strongly with those obtained by asking human assessors to directly observe estimated action sequences.For training, we provide both a dataset for end-to-end training of sequence forecasting and a multimodal knowledge base (MKB) built from that dataset, which is also the first video-based multimodal knowledge base for human activities to the best of our knowledge.
We conduct the first empirical study to investigate compositional generalization for the target task.As baselines, we adapt three strong end-toend deep generative models for this task and propose a neurosymbolic planning baseline using the MKB.The model is neurosymbolic because it combines both deep neural networks and symbolic reasoning (Garcez and Lamb, 2020).Given a video of initial activities and an intent, the deep models generate the top-k relevant action sequences, while the neurosymbolic planning model sends the intent and the action sequence recognized from the video as the query to the MKB, followed by retrieving the top-k relevant action sequences.Each model selects the most plausible answers by performing probabilistic reasoning over the relevant action sequences.We conduct extensive experiments and obtain the following key experimental results: • We compare the evaluation results using MQA with the ones of human evaluation.The results of both methods are well aligned.Thus, MQA is reliable without requiring human effort.• The likelihood functions of the deep generative models are not able to reliably infer which answers are plausible.In contrast, probabilistic reasoning is an effective method to improve compositional generalization.• Despite information from both modalities be-ing useful and complementary, all baselines heavily rely on intents in textual form but fail to effectively exploit visual information from video clips.

Related Work
Vision-Language Planning Task Vision Language Navigation (VLN) was among the first widely used goal-oriented vision-language tasks, requiring AI agents to navigate in an environment without interaction by reasoning on the given instruction (Anderson et al., 2018;Hermann et al., 2020;Misra et al., 2018;Jain et al., 2019).Recently, further goal-oriented vision-language tasks have been proposed.The Vision and Dialogue History Navigation (VDHN) task (De Vries et al., 2018;Nguyen and Daumé III, 2019;Thomason et al., 2020), which is similar to VLN, requires agents to reason on the instructions over multiple time steps.Other tasks such as Embodied Question Answering (EQA; Das et al. 2018;Wijmans et al. 2019), Embodied Object Referral (EOR;Qi et al. 2020b;Chen et al. 2019) and Embodied Goaldirected Manipulation (EGM; Shridhar et al. 2020;Kim et al. 2020;Suhr et al. 2019) rely on reasoning and interpreting the instruction with observation or object interaction in the environment.However, we argue that there are other ways to learn to plan without practising.Our task is one example of this, requiring agents to reason over the observation without performing actions.
Vision-Language Planning Datasets As existing vision-language planning datasets emphasize teaching embodied AI to perform the task like humans, they are constructed with interactive AI in mind.VLN (Anderson et al., 2018) datasets initially started exploring planning tasks with the textual instruction as a step-by-step abstract guide and minimal interaction with the environment.Extending the VLN task, VDHN (De Vries et al., 2018) datasets provide an interactive textual dialogue between the speaker and the receiver in multiple steps.The EQA (Das et al., 2018) task takes this a step further by providing data in an object-centric QA manner, advancing systems to understand the given environment through object retrieval.The EOR (Qi et al., 2020b) task designs object-centric datasets with detailed instructions, aiming at localizing the relevant objects accurately.The closest benchmark to ours is ALFRED (Shridhar et al., 2021) from the EGM task, which lets embodied agents decide on actions and objects to be manipulated based on detailed instructions.However, in our setting, we ask intelligent systems to predict the most reasonable future action sequence based on human intents and answers in a Multiple Choice Question Answering (MQA) format.During prediction, we still give systems the flexibility to consider various combinations of actions and objects.
Vision-Language Planning Modeling According to Francis et al. (2021), several approaches have been used for planning.Greedy search in end-toend models has been reported in several studies to work well in goal-oriented tasks (Fried et al., 2018;Das et al., 2018;Shridhar et al., 2020;Anderson et al., 2018).Task progress monitoring (Ma et al., 2019) is another method to tackle the planning.It allows models to backtrack on actions if the current action is found to be suboptimal.Mapping (Anderson et al., 2019) has as well been proposed for efficient planning via sensors.Topological and Exploration planning (Deng et al., 2020;Ke et al., 2019) enables modeling the planning in a symbolic manner.When goals are provided as several sub-goals, a divide and conquer strategy (Misra et al., 2018;Shridhar et al., 2020;Suhr et al., 2019) may be invoked to perform sub-task planning.In our work, we highlight another potential approach, knowledge base retrieval.As we construct an MKB containing various action sequences with detailed features, intelligent agents can retrieve the most suitable sequence from the MKB source in order to perform the planning.

Dataset Construction
We adopt videos from Charades (Sigurdsson et al., 2016) and solicit intents for videos via crowdsourcing.We consider videos that have action sequences of sufficient length appearing in both initial video clips and answers, which result in a dataset comprising 2,912 videos.The dataset is split into training/validation/test sets with a ratio of 70%, 10%, 20%.On the training dataset, we build an MKB by incorporating structural and conceptual information.On the test dataset, we collect a set of MQAs for model evaluation.The evaluation with MQAs is in fact an adversarial testing method, widely used for quality estimation in machine translation (Kanojia et al., 2021).Herein, the ability of a model to discriminate between correct outputs and meaning-changing perturbations is predictive of its overall performance, not just its robustness.
Thus MQAs are applied only for testing.

Data Normalization and Filtering
Charades is a large-scale video dataset of daily indoors activities collected via Amazon Mechanic Turk2 (AMT).The average length of videos is approximately 30 seconds.It involves interactions with 46 object classes and contains 157 action classes, which are also referred to as actions for short.Each action is represented as a verb phrase, such as "pouring into a cup".This dataset is chosen because i) it contains a sufficient number of long action sequences of human daily activities; ii) the intents are easily identifiable, as the activities in the videos are based on scripts; iii) there are rich annotations of videos that can be leveraged for dataset construction.The details of action sequence selection in videos are presented in Appendix 7.1, with the goal of choosing core action sequences having clear human goals.
In order to assess the quality of extracted action sequences, we randomly sample 100 videos from the test set for manual inspection.The primary action sequence of each video is evaluated in terms of three criteria: i) if all actions of a sequence occur in the video; ii) if the actions of a sequence appear in the same order as in the video; iii) if a sequence has any actions missing between the first and the last action.In total, we determined that 94 videos have all actions of their action sequences covered in the video.The actions of 92 videos appear in the same order as in the videos.Furthermore, 85 videos have no actions missing between the first and the last action of their sequences.Thus, the quality of such action sequences is adequate for VL planning evaluation.
Following prior work (Ng and Fernando, 2020), we consider the first 20% of a video as its initial visual state and aim to forecast future actions appearing in the remaining part of the video for a given intent.To have at least one future action per video, we retain only videos that contain at least one action sequence comprising more than three actions.As a result, we obtain 2,912 such videos, each of which is associated with one action sequence of length longer than three.

Intent Annotation
An intent may be defined as "something that you want and plan to do". 3 Philosophers distinguish between future-directed intents and present-directed ones (Cohen and Levesque, 1990).The former guide the planning of actions, while the latter causally produce behavior.As the focus of this work is anticipating and planning actions, we encourage crowd-workers to also provide futuredirected intents.
We recruit crowd-workers to annotate videos with future-directed and present-directed intents.Each annotator is provided with a full video clip and the associated action sequence.They are instructed to answer the question what the person wants to do by taking the actions in the video.Every annotator is asked to submit two intents.One of them should describe which activity the person intends to take, such as "drink a glass of water".The other one needs to be at a high-level, such as "quench the thirst" or "be thirsty".The permitted formats are either "S/He wants to + do_something" or "S/He is + feeling".Thus, the annotators are encouraged to provide future-directed intents by differentiating them from ones causally leading to behaviours.To ensure the quality of intent annotations, we randomly assign three crowd-workers to write intents per video.The process of constructing the dataset for intent annotation involved a rigorous validation and selection process.One of the authors acted as an expert annotator, and conducted a thorough review of all crowd-sourced intents to identify and select the most reasonable annotations as the final results.The validation process was completed in three rounds, yielding increasingly higher percentages of reasonable annotations, with 82%, 94% and 100% respectively for each round.The annotations that did not meet the required criteria were discarded and not included in the final dataset.This rigorous validation process ensured that the final dataset is comprised of high-quality and relevant annotations, providing a robust foundation for subsequent modeling and analysis.

Multimodal Knowledge Base
We construct the MKB of human activities based on the training set and validation set by taking a neurosymbolic approach.The main challenges herein are twofold: i) how to represent multimodal information from videos, action names, and intents adequately to facilitate information retrieval; ii) how to model shared knowledge of multimodal information.For the former, we allow both string and embedding based retrieval methods by attaching neural representations of video clips and texts to symbols of actions and action sequences.For the latter, we employ the classical planning language STRIPS (Bylander, 1994) and neural prototypes to encode abstract properties of actions.
At the core of the MKB is a knowledge graph G = (V, E), where the node set V comprises four types of nodes: action classes, action video clips, action sequences, and action sequence videos, while the edge set E contains edges reflecting relationships between nodes.
An action class a c is the abstraction of an action described in the language of STRIPS.The attributes of an action class include its ID, its name τ , its precondition set PRE, its add effect set ADD, and its delete effect set DEL.An action is executed only if its preconditions are satisfied.The effect sets ADD and DEL of an action class describe the add and delete operations applied to the current state after executing the action.For example, the precondition of Closing a refrigerator is isOpen(refrigerator), ADD = isClosed(refrigerator) and DEL=isOpen(refrigerator).In this way, the properties described in STRIPS present the shared knowledge of each action class.An action sequence comprises a future-directed intent, a present-directed intent, and a sequence of action IDs.An intent is represented by both a word sequence and the distributed representation of the word sequence.We obtain the distributed representation of an intent by applying BERT (Devlin et al., 2018) and utilizing the representation of the CLS token.The collection of action sequences can be easily turned into a training set for end-to-end models by associating them with the corresponding video files.
The MKB includes two types of visual nodes: action sequence videos and action video clips.Each action sequence video is linked to the corresponding action sequence.For each action in an action sequence, we associate it with the corresponding video clip, as illustrated in Fig. 2. For each action video clip, we apply I3D to encode it into a sequence of frame-level visual feature vectors {f s 1 , f s 2 , . . ., f st }, where each vector f s i ∈ R 1024 corresponds to the features of an 8-frames snippet.
To represent an action sequence video, we apply average pooling to the distributed representations of all involved video clips.
Relations.We consider two types of relations in the MKB.The first type of relation links an action sequence to the corresponding visual representation.The other type of relation associates an action in an action sequence with the corresponding action class.Therefore, it is easy to perform symbolic reasoning by using the STRIPS properties of each action class involved in an action sequence.

Multi-Choice Comprehension Questions for Evaluation
Given the first 20% of a video as the initial state s and a future-directed intent g in text, the planning evaluation task involves choosing the most plausible future action sequence a f among six available choices.We determine the initial action sequence a i by checking if an action of a sequence starts before the end time of the initial state.To build such a dataset, we extended the test set with adversarially generated incorrect answers.As the automatic approach may generate reasonable action sequences, we recruit another group of students to manually check all answers and determine the most plausible ones as the correct answers on AMT. Figure 3 shows an example of our planning task.The key idea here is to substitute an action of an observed action sequence for an alternative action that is relevant to the preceding actions and is not overly similar to the action to be replaced.As many videos in the test set have only a single future action, the AM algorithm is extended to optionally insert a future action to generate an answer candidate.More specifically, given the initial state, the action sequence, and the intent (s, a, g) of a video, where a = (a i , a f ), the algorithm starts by randomly deciding if it applies substitution or insertion to generate an answer candidate.If insertion is chosen, it inserts an action randomly selected among the 157 candidate actions, at a position that is randomly picked after the last action in a i .If instead substitution is chosen, we feed the initial action sequence a i to BERT and use the representation of the CLS token as the representation of a i .Then we apply BERT to turn each action into a vector by using the corresponding CLS representation.We randomly pick a future action a i in a f and compute the score of a candidate action a j as s(a j ) = log(P sim (a i , a j ))

Example
where P sim () is defined as cosine similarity.We set λ = 0.7 to find an optimal tradeoff between the obfuscation level of an incorrect answer and the probability of being a reasonable answer.We repeat this process until we have generated five answer candidates.For each set of generated answer Quality Check via Crowd-Sourcing.We hired three crowd-workers per video on AMT to ascertain the quality of all auto-generated answers.For each video, a worker is presented with the first 20% of the video and the future-directed intents, which are paired with six answer candidates each (an original action sequence and five generated ones), because there were two annotators working on each video.They were instructed to choose the most reasonable pair of intent and action sequence among all possible combinations.
After checking the answers of all questions in the test set, we apply a set of heuristic rules to determine the final answer to each question.We calculate inter-annotator agreement by asking the group of workers that did the annotation to work on a sample of multi-choice questions of the MQA task.
To evaluate the quality of the MQA choices, we determined the number of agreements between the ground truth (the correct answers) and the predicted answers.Then, we computed the number of agreements that would be expected by chance based on the distribution of answers.The corresponding Cohen's kappa coefficient (Kraemer, 2014) is 0.91, which demonstrates the high quality.
Table 2 shows the basic statistics of the test set.The average number of observed actions in s is similar to the average number of future actions.Although all actions in the test appear in the training set, the most plausible action sequences of almost 400 videos are unseen in the training set.For intents in MQA, we also calculate the number of distinct future action sequences for each of them, and the standard deviation across all of them.The results indicate how diverse potential future action sequences can be for a single intent.Other details of MQA can be found in Appendix 7.2.

Baselines
VL plannning of human activities requires predicting future action sequences given an initial visual state video and an intent provided in textual form.The task poses two major challenges.First, information provided in two modalities are complementary to each other, while the majority of multimodal research focuses on the shared information by exploring fusion techniques (Guo et al., 2019).Second, the output space is exponentially large with respect to the action space.It is not realistic to assume that all action sequences are already observed in the training data.Hence, any models to tackle this task are expected to address systematic composition (Fodor and Pylyshyn, 1988) of human activities, the capacity to understand and produce a huge number of novel combinations of known actions.In contrast, state-of-the-art deep learning methods often perform poorly on compositional generalization (Lake, 2019; Keysers et al., 2019).
We compare deep generative models and a neurosymbolic planning model in the framework of retrieval and reasoning.Given the first 20% of a video and a future-directed intent, the first step is to obtain top-k relevant action sequences, followed by performing reasoning over the top-k action sequences to find the most plausible answers.Both types of models share the same reasoning module but differ in how they obtain top-k action sequences.For reproducibility, the details of all models are provided in Appendix 7.3 and 7.4.

Deep Generative Models
The deep generative models apply beam search to produce the top-k most likely future action sequences, followed by performing reasoning.
ACT-UNIVL We adapt UNIVL (Luo et al., 2020) for the target task (denoted as ACT-UNIVL), which is a SOTA unified pretrained vision-language model for multimodal understanding and generation.We consider ACT-UNIVL because it performs the best on the tasks that are closest to our target task, such as YouCook2 (Zhou et al., 2017).The pre-trained ACT-UNIVL takes as input an intent and an initial video clip, and is fine-tuned to forecast future action sequences.
Two Stage Planning Model.The two stage planning baseline, TwoStagePlan for short, starts by converting an initial video clip into an action sequence in text by using ACT-UNIVL, followed by applying a pre-trained language model, ProphetNet (Qi et al., 2020a) (denoted as ACT-PROPHETNET for ViLPAct), to predict future actions.
ACT-PROPHETNET To study the impact of visual information, we consider a text-only baseline by employing ACT-PROPHETNET to predict future action sequences only based on intents.

Neurosymbolic Planning Model
Given an intent and an initial visual state, the neurosymbolic planning model (NSPlan) retrieves topk relevant action sequences from the MKB in two stages, and then utilizes the retrieved results to infer the most plausible answers.
In the first stage, we apply the pretrained ACT-UNIVL to convert a video clip into an action sequence and send it as a query to the MKB to retrieve top-50 results.For each retrieved result, the ranking score is the weighted sum of the BM25 (Robertson and Walker, 1994) score between two action sequences and the cosine similarity between the intents.
In the second stage, it re-ranks the initial retrieval results by using both visual and symbolic knowledge.Each retrieved action sequence is represented as a sequence of frame-level visual feature vectors, extracted by the visual encoder I3D.An Ordered Temporal Alignment Module (OTAM; Cao et al. 2020) is applied to compare two visual feature sequences.In order to rank the sequences with potential future actions higher, we use a rule-based score function to prefer longer sequences containing unseen actions.In the end, we keep only the top-k results for probabilistic reasoning.

Probabilistic Reasoning for MQA
We propose a novel approach for MQA called ProbInf , which, based on the top-K action sequences, performs probabilistic inference over the retrieved action sequences to identify the most likely answer for a question.From each retrieved result after re-ranking, obtained from NSPlan, we remove the predicted observed action sequence s a q to obtain potential future action sequences.For generative models, we directly use the generated outcomes.For each answer candidate c i of a question, we compute p(c i | s, g) by integrating over all retrieved results {r 1 , r 2 , ..., r K }, given the initial visual state s and intent g: is the normalized ranking score for a result r j and p(c j | r k ) is the normalized similarity between an answer candidate and each retrieved result.As both answers and retrieved results are action sequences represented in text, we employ the time series metric Time-warped edit distance (TWED; Marteau 2009) to compute their similarity as ϕ(f where f (c i ) denotes the visual prototype representation of an action sequence and d twed (f (c i ), f (r j )) denotes the distance computed by TWED algorithm.Then the normalized similarity over n possible answers of a question is given by: The most plausible answer is the one with the maximal p(c j | s, g) over all answer candidates.

Experiments
We conduct extensive experiments to answer the following three main research questions.The other research questions are addressed in Appendix 7.9.RQ1: How reliable is the MQA evaluation method?We show that the evaluation results using MQA are consistent with those by asking humans to directly observe model outputs.For this, we recruit five crowd-workers to rank all models in comparison on each of the 100 questions randomly sampled from the test set, and compare them with the corresponding results using MQA.Specifically, for each question, a crowd-worker is asked to rank the top-k outputs of the four baselines in terms of how well they match the intent and the remaining 80% of the original videos.As a result, Figure 4 shows how frequent each model is ranked at position X judged by the crowd-workers w.r.t. the top-10 predictions (left) and top-1 predictions (right), respectively.In both cases, we consistently find that the best model is ACT-UNIVL, followed by ACT-PROPHETNET, NSPlan, and TwoStagePlan.The ranking result is the same as using MQA on the same set of questions.The ranking differences on individual questions between the human evaluation and MQA are statistically insignificant according to Wilcoxon's signed-rank test (Woolson, 2007), details of which can be found in Appendix 7.8.RQ2: What are the key challenges?We identify two major challenges of the target task.
Compositional Generalization Using Reasoning.
It is common practice to rank each answer by the likelihood yielded by a generative model (Holtzman et al., 2021).However, Table 3, which provides the overall evaluation results using MQA, shows that the generative baselines perform poorly when they rank answers based on the likelihood.In contrast, ProbInf effectively uses top-k results to boost the performance of all generative models by more than 44%.For the respective performance on seen and unseen action sequences (Table 4), ProbInf delivers stable results across models.The performance on unseen combinations of seen actions measures exactly the ability of compositional generalization.This raises the question of "Why ProbInf helps compositional generalization ?" for future research.As there is still a sizable gap between seen and unseen action sequences, and all models fall short of the human performance (Table 3) by at least 23%, how could we make further improvements?
Effective Use of Both Modalities.To understand the utility of each modality, we compare the two strongest multimodal models by varying their inputs: including both modalities or just a single modality.As shown in Table 5, intents provide the strongest signal, while visual information is useful overall for both models.This also explains why ACT-PROPHETNET comes close to ACT-UNIVL.
To further investigate the significance of visual information for multimodal models, we substitute the visual features of ACT-UNIVL for randomly selected ones during both training and inference, finding that ACT-UNIVL suffers from only a 4% drop of accuracy using MQA.Hence, the multimodal models capture only weak associations between visual features and future action sequences.
It is counter-intuitive that visual features do not play a significant role, because plans vary in accordance with different visual environments.We conjecture this is due to poor performance of action recognition.To verify this, we feed ground-truth actions observed in the first 20% of videos to both TwoStagePlan and NSPlan during training and inference.They reach an accuracy of 82.11% and 81.37% respectively, improved by more than 15%.

RQ3:
To what degree can the top-k results reflect the performance differences of systems?The reasoning method ProbInf leverages the top-k results produced by the models, hence it is useful to inspect those results for further insights.Therefore, we compare the top 10 results of each model in terms of precision and recall by treating each action sequence as a set (Ng and Fernando, 2020), as well as seq-hits@5 for measuring exactly matched action sequences.Moreover, to investigate the diversity of the top-k lists, we consider Dist1 and Dist2 (Li et al., 2016), which respectively measure the number of unique action and consecutive action pairs in the top-k lists.The definitions of a complete list of used metrics and their results are provided in Appendix 7.6 and 7.4.1.
According to Table 6, ACT-UNIVL outperforms all other models in terms of quality-oriented metrics but falls short of ACT-PROPHETNET in terms of both diversity metrics.However, none of the metrics obtains the same ranking of models in accordance with the human evaluation.Although NSPlan achieves higher recall than ACT-PROPHETNET, its precision and seq-hits@5 are significantly lower than those of ACT-PROPHETNET, explaining why it performs worse than ACT-PROPHETNET using MQA.

Conclusion
We construct the novel benchmark ViLPAct to evaluate the ability of systems to anticipate and plan human actions in a multimodal visionlanguage setting, with a focus on evaluating their compositional generalization capabilities.In this benchmark, we extend Charades with intents, construct a test set with multi-choice questions, and include four strong baselines.Our empirical studies demonstrate that the task is easy for humans, but challenging for SOTA deep learning models due to the need for compositional generalization and an effective use of information from both modalities.The neurosymbolic planning baseline shows a promising research avenue for using symbolic and multimodal knowledge in an MKB.

Ethical Considerations
In order to mitigate the potential for exposure to problematic content in the Charades video dataset, we have implemented stringent safety measures to safeguard our annotators against adverse psychological effects.To ensure the suitability of the video content, the authors initially conducted a comprehensive review.However, it is recognized that the process of annotating feedback may still result in the exposure to potentially disturbing or offensive material.To mitigate this, we only engage annotators who are of legal age and clearly communicate that discretion is strongly advised when engaging in the annotation process.In the event that an annotator experiences discomfort or distress, we provide information on how they can seek support from the Substance Abuse and Mental Health Services Administration (SAMHSA) 4 , a free and confidential resource available 24/7.In addition, we have established a feedback mechanism to allow annotators to communicate their concerns in real-time.Our response time to any feedback received is within 24 hours.Furthermore, we compensate our annotators with competitive wages, with an average hourly rate of approximately $12.

Action Sequence Extraction Algorithm
Each video of Charades is annotated with actions from at least one action sequence.The starting and ending points of an action are labelled, but it is not clear which actions jointly meet an intent.Therefore, we implement the greedy method in Algorithm 1 to automatically extract action sequences with clear intents from videos.For each video, the algorithm aims to identify a sequence of temporally and semantically coherent actions, which interact with the same or related objects.The scoring functions in Algorithm 1 measure coherence from three perspectives: i) semantic relevance based on TF-IDF (Jones, 1972) reweighted Word2Vec embeddings (Mikolov et al., 2013), ii) temporal relevance, iii) task relevance.Each action is assigned to one of 22 tasks manually, for example, "Opening a book" and "Closing a book" are assigned to the same task.

Other Data Details
An example of future action sequences of a selected intent is given in Figure 5.All of these conclusions pose a challenge not only for the generalization of multimodal matching, but also for compositional generalization.
Algorithm 1: Extract Action Sequences

Deep Generative Models Details
We mainly adapt the multimodal deep planning model ACT-UNIVL to tackle our task.The training set of ACT-UNIVL consists of 2,402 videos, each of which contains a video clip of the initial state s, an observed action sequence a i , an intent g, and a future action sequence a f .Both models are trained to minimize prediction errors of a f .ACT-UNIVL ACT-UNIVL (Luo et al., 2020) is a SOTA unified pretrained vision-language model for multimodal understanding and generation.We consider ACT-UNIVL because ACT-UNIVL still performs the best on video captioning tasks, such as YouCook2 (Zhou et al., 2017).YouCook2 contains task-oriented and instructional third-person videos about indoor cooking.The captions of a video are provided for the whole video without explicit alignments at the frame or segment levels.In addition, ACT-UNIVL considers two sources of textual inputs: transcripts and captions.Hence, it is most close to our target task.Taking as input a future-directed intent and a video clip of the initial state, ACT-UNIVL is fine-tuned to forecast future action sequences.
More specifically, we utilize ACT-UNIVL to map a video clip to a sequence of action names.Most of the action names are multi-word expressions.During training, ACT-UNIVL takes as input both the visual features of a video clip s and an observed action sequence a i , and optimizes the model with multiple pre-training objectives.The visual features are extracted by the I3D model (Carreira and Zisserman, 2017) trained on Charades.During prediction, the model generates a future ac-tion sequence by only taking an initial visual state and high-level intent as input.To fine-tune ACT-UNIVL, we set the max.frame, mean frame and feature frame rate of the encoded features to be 629, 113 and 3. We fine-tune ACT-UNIVL on two NVIDIA V100 GPUs for 50 epochs and choose the best one based on the BLEU-3 metric.
Two Stage Planning Model.The two stage planning baseline, TwoStagePlan for short, starts by converting the initial visual state s into a textual description of the observed action sequence, followed by applying a Seq2Seq language model, ACT-PROPHETNET (Qi et al., 2020a), to predict future actions.
At Stage 1, we adopt ACT-UNIVL on the video captioning task.Different from the single ACT-UNIVL baseline, we only train it with observed video clip inputs and let it generate the corresponding captions for observed action sequences.The other settings and training settings remain the same as for the single ACT-UNIVL baseline.
Given an observed action sequence recognized by ACT-UNIVL, we fine-tune ACT-PROPHETNET by following Jansen (2020) in Stage 2. We prefer ACT-PROPHETNET over GPT2 (Radford et al., 2019) because it can learn to predict n future tokens jointly, which is computationally efficient and mitigates overfitting on strong local correlations.For each video, we take as input the intent and the observed action sequence, separated by a special token SEP, and train the model to minimize prediction errors of future action sequences.Fine-tuning the model from the PROPHETNET-EN pretrained checkpoint for 50 epochs on 2 Nvidia Tesla V100 GPUs, we choose the best model based on the validation loss.
ACT-PROPHETNET To study the impact of visual information, we consider a text-only baseline by employing ACT-PROPHETNET.Herein, ACT-PROPHETNET takes as input an intent and generates the future action sequences.The training is done with the same training procedure as Stage 2 of TwoStagePlan.This model serves for an ablation study, in contrast to TwoStagePlan, which uses additionally recognized action sequences as input.

Neurosymbolic Planning Model
Instead of using the data in the training set to directly optimize model parameters, the neurosymbolic planning model (NSPlan) builds an MKB from the training data.Given a question in the test  set, the model retrieves relevant knowledge based on the initial visual state and the intent, and then applies the retrieved knowledge to infer the most plausible answers from all available choices.

Retrieval from Multimodal Knowledge Base
The neurosymbolic planning model retrieves relevant action sequences from the MKB in two stages.
The first stage aims to computationally efficiently obtain all relevant action sequences.At the second stage, it re-ranks the initial retrieval results by using both visual and symbolic knowledge.
First Stage.Given the initial state of a video, we apply the pretrained ACT-UNIVL model used in the two-stage planning model to predict a sequence of observed actions.Then this action sequence in text form is sent as query to retrieve top-50 relevant action sequences from the MKB.For each retrieved result, the ranking score is the weighted sum of the BM25 (Robertson and Walker, 1994) score between two action sequences and the cosine similarity between the intents.At this stage, only textual information is taken into account, and the temporal order of actions in a sequence is not considered because BM25 considers each action sequence as a bag of words.
Second Stage.We re-rank the results from the first stage by taking temporal order and the visual features of action sequences into account.Each action sequence is represented as a sequence of frame-level visual feature vectors, which are extracted by the same visual encoder I3D.We apply the Ordered Temporal Alignment Module (OTAM) (Cao et al., 2020) to compare two visual feature sequences.OTAM computes a distance between a pair of sequences by integrating video segment distances only along the ordered temporal alignment path.We turn a distance into an alignment score by s align = 1/(1 + d otam ), where d otam denotes the OTAM distance.Many retrieved action sequences do not contain future actions.In order to rank the sequences with potential future actions higher, we add a rule to encourage long sequences containing unseen actions.The rule score s rule = s last + s len is the sum of two binary indicator functions s last and s len , where s last = 1 if and only if the last action of the retrieved result is not contained in the query set, and s len = 1 if and only if the length of the retrieved result is greater than that of the query.The final ranking score s f (r) of a result r is the weighted sum of the initial ranking score, the alignment score s align and the rule-driven score s rule .To reduce noise, we keep only the top-10 results for probabilistic reasoning.We provide a completed version of the comparison among all baselines on future sequence evaluation in Table 7.

Metrics
• Seq-item-acc: Sequence item classification accuracy evaluates the exact action matching of the predicted action sequence with the ground truth, counting how many times the action in the predicted sequence matches the ground truth at the exact position.For top-10 sequences, we calculate the mean accuracy of all sequences.
• Precision and recall: The precision and recall do not consider the order of ground truth.They both treat the actions inside the sequence as a unified set.The precision of top-10 sequences is computed by averaging the precision of each sequence, which measures the number of true actions over the number of total actions in the sequence.Here, we define the true action as the action that occurred in the ground truth.Similarly, the recall of top-10 sequences is also computed by averaging all sequences' recall, which is a measure of the true actions over the number of ground truth actions.
• Seq-hit@k Rate: The seq-hits scores measure the exact sequence matches, calculated as the number of examples whose top-k sequences include the ground truth sequence, and we report the seq-hits@5 and seq-hits@10 accordingly.As for the retrieval-based baseline, we

Figure 1 :
Figure 1: In daily life scenarios, an agent should be aware of future actions that will likely be taken by the user based on what it has observed.In this example, inputs of intent and observation are colored in green, while potential future action sequences are highlighted in orange.The first two sequences contain actions which do not align with the human intent.Thus, the agent needs to automatically detect which future actions are plausible by understanding the user's intent.

Figure 2 :
Figure 2: An example action sequence in the MKB.

ExampleQuestions:
Figure 3: Two examples of ViLPAct MQA taskGeneration of Incorrect Answers.We adapt the Adversarial Matching (AM) algorithm(Zellers  et al., 2019)  to turn the action sequence generation task into a multi-choice test.The key idea here is to substitute an action of an observed action sequence for an alternative action that is relevant to the preceding actions and is not overly similar to the action to be replaced.As many videos in the test set have only a single future action, the AM algorithm is extended to optionally insert a future action to generate an answer candidate.More specifically, given the initial state, the action sequence, and the intent (s, a, g) of a video, where a = (a i , a f ), the algorithm starts by randomly deciding if it applies substitution or insertion to generate an answer candidate.If insertion is chosen, it inserts an action randomly selected among the 157 candidate actions, at a position that is randomly picked after the last action in a i .If instead substitution is chosen, we feed the initial action sequence a i to BERT and use the representation of the CLS token as the representation of a i .Then we apply BERT to turn each action into a vector by using the corresponding CLS representation.We randomly pick a future action a i in a f and compute the score of a candidate action a j as

Figure 5 :
Figure 5: An example of the future action sequence frequency distribution of the intent "S/He wants to satisfy my hunger".There are 30 distinct future action sequences matching this intent.

Figure 6 :
Figure 6: The neurosymbolic planning model is a multimodal retrieval & re-rank pipeline.

Table 1 :
Statistics of the MKB / training + validation set

Table 2 :
Basic statistics of MQA task / test set candidates, we manually checked the grammaticality and fixed all the errors.

Table 3 :
Comparison of all systems, with Human performance of 94.25% accuracy, which is obtained by asking humans to answer the MQAs directly.

Table 4 :
Top-10 Reasoner-scoring Accuracy on seen and unseen action sequences.Seen data refers to the MQAs with plausible action sequences observed in the training data.Unseen data refer to the ones with plausible action sequences not observed in the training data.

Table 6 :
Comparison of top-10 future sequences Input: Actions = {a1, a2, . . ., an}, each action ai = ⟨cls a i , t a i s , t a i e ⟩, where cls a i is the action class, t a i s and t a i e is the start time and end time of action ai.Relevance threshold Output: Activities = {A1, A2, . . ., An}, where each activity represents an action sequence Remaining actions set Ra = Actions while Ra ̸ = ∅ do Sort Ra in ascending order by start time ts pre action a = Ra[0] Activity A = {a} Search = T rue while Search do candidates Ca = {aj ∈ Ra|t