Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

The ability to sequence unordered events is evidence of comprehension and reasoning about real world tasks/procedures. It is essential for applications such as task planning and multi-source instruction summarization.It often requires thorough understanding of temporal common sense and multimodal information, since these procedures are often conveyed by a combination of texts and images.While humans are capable of reasoning about and sequencing unordered procedural instructions, the extent to which the current machine learning methods possess such capability is still an open question.In this work, we benchmark models’ capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from online instructional manuals and collecting comprehensive human annotations.We find current state-of-the-art models not only perform significantly worse than humans but also seem incapable of efficiently utilizing multimodal information.To improve machines’ performance on multimodal event sequencing, we propose sequence-aware pretraining techniques exploiting the sequential alignment properties of both texts and images, resulting in > 5% improvements on perfect match ratio.


Introduction
Instructions are essential sources for agents to learn how to complete complex tasks composed of multiple steps (e.g., "making a wood sign from scratch").However, instructions do not always come in a proper sequential order, for example, when instructions must be combined across sources (e.g., to accomplish a complex task there might be multiple useful resources for certain task-steps come out from a single Google search).Therefore, sequencing unordered task-steps is crucial for comprehending and inferring task procedures, which requires thorough understanding of event causal and temporal common sense.It is essential for Sand the wood block.Press a sandpaper on and rub it.

Disambiguate
Apply a wood primer to the surface.
Apply the paint to the wood surface if you want bold, solid colors.
Use long strokes to apply a wood stain for a more rustic look.

Potential Ambiguity
Sand the wood block.Press a sandpaper on and rub it.
Apply a wood primer to the surface.
Apply the paint to the wood surface if you want bold, solid colors.
Use long strokes to apply a wood stain for a more rustic look.
How To Make A Wood Sign applications such as multi-source instruction summarization and robot task planning (Garattoni and Birattari, 2018).
Existing work has studied sequencing unordered texts from paper abstracts or short stories (Chen et al., 2016;Cui et al., 2018).However, real-life tasks are often complex, and multimodal information is usually provided to supplement textual descriptions to avoid ambiguity or illustrate details that are hard to narrate, as illustrated in Figure 1.
To investigate whether current AI techniques can efficiently leverage multimodal information to sequence unordered task instructions, we curate two datasets from online instructional manuals (Hadley et al.;Yagcioglu et al., 2018).We consider two representative instruction domains: cooking recipes and "How-To" instructions (WikiHow).We establish human performance for the sequencing task on a subset of each data resource.As certain steps to perform a task can potentially be interchangeable, 1 we collect annotations of possible orders alternative to the originally authored ones to create multiple references.Such additional annotation provides not only better measurement of human and model performance by alleviating unintended biases from content creators, but also a useful resource for future research of models that are aware of task-step dependencies and interchangeability.
To measure the ability of state-of-the-art AI techniques to sequence instruction steps, we construct models consisting of: (1) an input encoder which encodes image, text, or multimodal inputs, and (2) an order decoder which predicts step order using the encoded representations.They are jointly trained with the order supervisions.
Our preliminary studies show that multimodal information is consistently helpful for the sequencing task.However, compared to humans, current models are less efficient in utilizing multimodal information.We hypothesize that it is because the models do not effectively capture the sequential information in the vision modality nor the sequential alignment between multimodal contents.To address this, we propose to equip models with capabilities of performing sequential aware multimodal grounding.Specifically, we propose several selfsupervised objectives, including sequence-based masked language modeling, image region modeling, and content swapped prediction, to pretrain the models before finetuning them on the downstream sequencing task.
The proposed pretraining techniques are shown to be effective in improving multimodal performance, enjoying a >5% improvement on the perfect match ratio metric.However, it is still significantly behind human performance (∼ 15% in perfect match ratio metric).The same trend is observed when alternative orders are considered.
Our key contributions are two-fold: (1) We propose a multimodal sequencing task with two curated instructional manuals, and comprehensive human annotations.(2) We investigate model performance on sequencing unordered manuals, and propose sequence-aware pretraining techniques to more effectively use the multimodal information.Our experiments and extensive analysis provide insights on which task categories are most challenging for the state-of-the-art models.They also shed the light that more sophisticated sequential multimodal grounding are required to further improve the performance for the proposed multimodal sequencing task.

Problem Definition
Given a task procedure S consisting of N steps, where each step S i ∈ S can consist of two types of contents: a textual description T i of tokens {T i,k } n T k=1 and/or image(s) .2 A model is required to take as inputs a random permutation of S, i.e. S p = {S p 1 , ..., S p N }, where p is a permutation (S p j can take one of the following three modalities: T p j , I p j , and {T p j , I p j }), and predict the correct order of S p , i.e. argsort(S p ).

Datasets and Human Annotation
We are interested in understanding the current stateof-the-art models' performance on this multimodal instruction sequencing task.To this end, we curate instruction datasets to support our study.

Instruction Manual Datasets
There are three major features we require for the target datasets: (1) It is multimodal.(2) It consists of task procedures as sequences of steps.(3) Different modalities are used intentionally to complement each other.In light of these, we consider the following two datasets: RecipeQA.We start from a popular as well as intuitive choice of instruction manuals, recipes, which fully fulfill the aforementioned criteria.RecipeQA is a multimodal question answering dataset consisting of recipes scraped from Instructables.com (Yagcioglu et al., 2018).We utilize the recipes collected in RecipeQA and convert each unique recipe into sequential multimodal steps for our task.
WikiHow.To expand the types of instruction manuals for our task beyond recipes, we also consider a popular "How To ..." type of instructions, WikiHow, which is an online knowledge base that consists of human-created articles describing procedures to accomplish a desired task.Each article contains a high level goal of a task, a short summary of the task procedures, and several multimodal steps where each step consists of a description paired with one or a few corresponding images.
We scrape the entire WikiHow knowledge resource, containing more than 100k unique articles (mostly) with multimodal contents , as well as the hierarchically structured category for each article.

Human Performance Benchmark
To ensure the validity of our proposed multimodal sequencing task, we establish the human performance via Amazon Mechanical Turk.Since our dataset is constructed from resources that are not directly designed for the sequencing task, the quality of random samples is unverified.Specifically, some articles in WikiHow may not have a notion of proper order among the steps. 3As a result, to construct a high quality test set particularly for Wik-iHow for establishing human performance, we first identify a set of categories which are more likely to feature proper order, e.g.Home and Garden and Hobbies and Crafts.4A random proportion is then sampled and the co-authors further downsample the subset to 300 samples with the aforementioned criteria via majority vote.For RecipeQA, we randomly sample 100 recipes from the dataset.And hence, the resulting two subsets serve as our golden-test-set for performance benchmarking.
Human Performance.Prompted with a task goal and a randomly scrambled sequence of the tasksteps (can be one of the following modalities: mul-timodal or text/image-only), workers are asked to examine the contents and decide the proper performing order.Human performance are then computed against the original authored orders as the ground truths, averaged across the whole set.5 Alternative Orders.When performing a task, some steps can be interchangeable.To take the interchangeability into consideration in our benchmark task, we also collect possible alternative orders to the original ones to create multiple references.For each instance in our golden-test-set, given the instruction steps sequenced in their original order, we ask workers to annotate alternative orders if the presented task-steps can be performed following a different order. 6lthough in this work we are mainly focusing on sequential instructions and hence the interchangeability is also gauged in a sequential manner, we want to point out that the nature of task-step interchangeability is also highly related to parallel (branching) steps of tasks (Sakaguchi et al., 2021).We argue that the actions that can be performed interchangeably imply no direct dependencies are among these actions and thus can potentially be parallelized, and hence our alternative order formulation can help inferring these parallel actions.
More details of the two human annotation tasks can be found in Append.Sec.B.

Models
To benchmark the proposed task, we construct models comprising: (1) an encoder which encodes multimodal or text/image-only inputs, and (2) an order decoder which utilizes the encoded representations to predict the orders.To help models capture sequentiality in task-steps better as well as adapt to our target task domains, we pretrain the encoders with several self-supervised objectives on the instructions before integrating them with the decoder.

Input Encoders
Text-Only Encoders.We use RoBERTa (Liu et al., 2019) for text-only inputs.Although the nextsentence prediction in BERT (Devlin et al., 2019) Vision-Language (V & L) Transformer

[CLS]
Apply wood primer … CLIP Visual Encoder can potentially be exploited for sequencing, we empirically find that RoBERTa performs better.
Multimodal Encoders.We consider the following two V&L models mainly due to their easy adaptation to our proposed sequencing task: VisualBERT (Li et al., 2019) grounds object detected image regions (e.g. by Faster-RCNN (Ren et al., 2016)) to language with a single transformer model (Vaswani et al., 2017).VisualBERT is pretrained with: (1) multimodal masked language modeling (MLM) 7 , and (2) image-text matching prediction (ITM), where the image in an imagecaption pair is randomly replaced with another one to create misalignment, and the model is required to predict whether the current pair is aligned.
CLIP-ViL (Shen et al., 2021) is also a singlestream V&L model similar to VisualBERT, while the visual encoder is replaced by a patch-based model inspired by the ViT (Dosovitskiy et al., 2021) in CLIP (Radford et al., 2021), where the image features are taken as gridded-image-patches as shown in Figure 2. The pretraining objectives remain the same as VisualBERT.Empirically, both Shen et al. (2021) and this work find such patch-based model tends to yield better downstream performance.
Image-Only Encoders.We attempt to provide an image-only baseline on our sequencing task with two visual encoders: (1) ResNet-based (He et al., 2016) Faster-RCNN model (also the visual encoder in VisualBERT) where both the detected regional features and the whole-image-feature are used, and (2) the aforementioned patch-based CLIP model. 87 RoBERTa is used to initialize VisualBERT and CLIP-ViL. 8Without confusion, throughout the paper we term the ViTand CLIP-inspired visual encoder simply as CLIP.

Sequence-Aware Pretraining
The standard multimodal grounding techniques (Li et al., 2019;Lu et al., 2019;Su et al., 2020;Chen et al., 2020a) do not explicitly concern the sequentiality of text and associated image sequences, and hence may fall short of effectively utilizing the sequential properties in multimodal inputs.To encourage models to have better awareness of the sequential alignments in multimodal instruction steps, we propose to pretrain the encoders with the following self-supervised objectives: (1) masked language modeling (MLM), (2) (patch-based) imageswapping predictions (ISP/PISP), and (3) sequential masked region modeling (SMRM).Figure 2 illustrates an overview of the pretraining paradigm.
For the proposed objectives, the inputs to the models are generally ordered instruction step sequences, which can be further sub-sampled to produce length-varying subsequences.Although we do not find this necessarily benefit the downstream performance, it is observed that the sub-sampling helps the model converge faster.While all of our proposed objectives can be applied to sequence with arbitrary length (≥ 2), without loss of generality and for simplicity, the following sections assume the sub-sampled sequence is of length 2.

Masked Language Modeling
The standard MLM (Devlin et al., 2019) is employed by the text-only models to adapt a pretrained language model to the target domain (task instructions).Following prior V&L works, we apply MLM to multimodal models.Specifically, we ensure that the textual description of each step T i gets similar amount of tokens being masked-out such that the models can potentially exploit the image sequences more.9

Swapping-Based Prediction
This objective concerns, with certain probability, randomly swapping a pair of items in a sequence and asking the model to judge whether the resulting sequence is properly ordered or not (i.e.binary classification).We mainly perform the swapping in the image modality and hence it can be viewed as a sequence-aware version of ITM objective in most V&L models.As in ITM, the output representation at the [CLS] token is used to make the prediction.
Standard.For an ordered sequence S, we can randomly swap two10 items of S, {S i , S j }, where i < j, to {S j , S i }, with a certain probability δ.
Our preliminary studies find that swapping the textual contents does not necessarily help the downstream performance for either text-only or multimodal models, so we only perform the swapping on the images {I i , I j } in both multimodal and imageonly models.For patch-based image inputs (or regional features), the whole patches of an image are swapped with those of another one within the same sequence, as illustrated in Obj 2 in Figure 2.
Patch-Based.We can perform the aforementioned swapping prediction with a finer granularity, directly on the image patches.Assuming each image I i is cropped into w patches (or w detected regions), i.e. {i i,k } w k=1 = {i i,1 , ..., i i,w }, we randomly select M (ranging from 1 to w) number of patches each from the two images I i , I j (i.e.{i i,p }, {i i,q }, p, q ∈ M -sized sampled indices) to be swapped with probability δ.Specifically, for each image patch i i,m ∈ I i , a randomly selected image patch i j,n ∈ I j is sampled to be swapped with.The sampled M -sized indices do not need to be the same set of integers for each image.The Obj 3 in Figure 2 illustrates the patch-based swapping prediction with w = 4 and M = 2.

Sequential Masked Region Modeling
Prior works extend the masked learning to the visual modality, where the masked target is either a predefined discrete visual vocabulary (Sun et al., 2019;Bao et al., 2021) or (soft) object class labels (Lu et al., 2019;Su et al., 2020;Chen et al., 2020a).In this work, we construct a feature-based target vocabulary dynamically in each training batch.We first randomly select the same amount of X% (X = 15) patches for each image to be masked out (replaced with 0-tensor), and then construct a target vocabulary from the original output representations (before masking) of these patches.
Concretely, denote the output representation of an input image-patch i i,m as h(i) i,m and the masked positions of I i as D i , we can construct a candidate list from all the output representations of the patches at the masked positions of each image, i.e.C = {h(i) i,m }∪{h(i) j,n }, m, n ∈ D i , D j .Denote the masked image patches (the gray-colored image patches in Figure 2) as mask(i) i,m , for each output masked representation h(mask(i)) i,m , we concatenate it with all the candidates, i.e. h(mask(i)) i,m ||h(i'), ∀i' ∈ C, which results in |C| concatenated representations for each masked position.A |C|-way multi-class classification can then be performed by maximizing the probability of p(i i,m |h(mask(i)) i,m ; C).For robust training, we additionally: (1) shuffle the candidate set C for each masked position to prevent overfitting, and (2) ensure the overlapping of masked positions in each pair of images, D i ∩ D j , is < 50%, allowing the models to utilize information of similar regions from other images in the sequence.

Overall Training Objective
As the mechanism in some objectives cannot guarantee mutually exclusive impacts (e.g.performing ISP and PISP simultaneously may create confusing swapped patches), we employ a turn-taking fashion, with uniform probability, one of the objectives (Obj) is sampled for each training mini-batch.The overall pretraining objective is defined as below:

Order Decoder -BERSON
BERSON is a recently proposed state-of-the-art neural sentence ordering framework (Cui et al., 2020), where a pointer network (Vinyals et al., 2016) exploits both the local (relative pairwise order) and global (self-attentions on top of the entire input sequence) information of the inputs to decode the predicted order.BERSON mainly exploits the [CLS] output representations for relational understanding, which aligns well with how our encoders are pretrained (Figure 2).We integrate our encoders (with or without sequence-aware pretraining) into BERSON, replacing its original BERT encoder.The BERSON-module-specific components are freshly initialized and then the entire integrated module is finetuned on our sequencing task.
Our experiments seek to answer these questions: (1) How valid is the proposed task for humans to complete?(2) Is multimodality helpful?(3) Can the proposed sequence-aware pretraining utilize multimodality more effectively?(4) How would results differ when alternative orders are considered?

Evaluation Metrics
We adopt metrics from sentence ordering works: Position-Based metrics concern the correctness of the absolute position of each item in a sequence, including: (1) Accuracy (Acc) which computes the ratio of absolute positions in the ground truth order that are correctly predicted; (2) Perfect Match Ratio (PMR) which measures the percentage of predicted orders exactly matching the ground truth orders; and (3) Distance (Dist.)which measures the average distance 11 between the predicted and ground truth positions for each item.
Longest Common Subsequence computes the average longest subsequences in common (Gong et al., 2016) between the predicted and ground truth orders (L q ).We also consider a stricter version, longest common substring, which requires the consecutiveness for the comparisons (L r ).
Kendall's Tau (τ ) (Lapata, 2003) is defined as 1 − 2 × (# inversions)/(# pairs), where the inversion denotes that the predicted relative order of a pair of items is inverted compared to the corresponding ground truth relative order, and # pairs = N 2 for N -length sequence.Each metric focuses on different perspectives of the predictions, i.e. position metrics concern the absolute correctness, while common subsequence and τ metrics measure if general sequential tendency is preserved despite incorrect absolute positions.

Implementation Details
We use the original data splits for RecipeQA.For WikiHow, to prevent models' exploiting knowledge from similar articles, we split the data so that certain (sub)categories do not overlap in each split.We use only the train splits in each dataset to perform their respective pretraining.More details of the data splits are in Append.Sec. A. Preliminary studies show that joint training with both RecipeQA and WikiHow data does not necessarily improve 11 Except for distance metric, higher scores are better.the downstream performance, thus the models evaluated in the two datasets are trained simply using their respective training sets for faster convergence.
We cap the overall sequence length at 5 and each step description with maximally 5 sentences for both models and humans.The maximum input length per step is 60 tokens (overall maximum length = 300) for training and GPU memory efficiency.δ = 0.5 for both ISP and PISP.All images are resized to 224 × 224, and 32 × 32 patch is used for CLIP-based models, resulting in 7 × 7 = 49 patches per image.Aside from standard positional embedding, we only supplement a modality token type embedding (text:=0, image:=1) to the multimodal models.Pretrained weights for each encoder is obtained either from their corresponding code bases or by running their codes on our setup.12

Standard Benchmark Results
Table 2 summarizes both the human and model performance for each input modality evaluated using the original ground truth orders on the golden-testset, whereas Table 3 summarizes a more detailed breakdown of the model performance when incrementing combinations of pretraining objectives.
As is shown, multimodal information is verified consistently helpful for humans.Compared under same scenario with or without the sequenceaware pretraining, the two multimodal models consistently outperform their text-only counterparts, where the proposed pretraining technique is shown particularly effective for the patch-based multimodal model (CLIP-ViL).However, our topperforming models still exhibit significant gaps below human performance, especially in PMR.
Additionally, we observe a different trend in the two datasets where the multimodality benefits more in RecipeQA than WikiHow.The gap between the multimodal human and model performance is larger than the text-only counterparts in WikiHow, while a reversed trend is shown in RecipeQA.We hypothesize that recipes may contain more domainspecific language usages and/or less words for the pretrained language models and hence benefits more from the our in-domain sequence-aware pretraining.Humans, on the other hand, benefit more from the images in WikiHow as its texts are hypothesized to contain more ambiguities.
WikiHow Category Analysis.We are interested in knowing on which categories of WikiHow our  models perform closer to humans, and on which the multimodal information is most efficiently utilized.
In Figure 3 we select categories with the top and least performance gaps (with PMR metric, top=3, least=2) between the human and our best performing models.We observe that the categories on which multimodal models outperform the text-only ones the most are also the categories the models perform closest to humans, e.g.Home and Garden.
We hypothesize that the images in these categories are well complementary to the texts and that our sequence-aware grounding performs effectively.In contrast, in categories such as Arts and Entertainment and Hobbies and Crafts where humans still enjoy benefits from multimodal information, our models have difficulty utilizing the multimodal information.We hypothesize that better visual understanding may alleviate the potentially suboptimal grounding as images of these categories can contain many non-common objects.

Evaluating with Alternative Orders
For each instance where alternative ground truth orders exist, the performance is computed by the best each predicted order can obtain against all the  ground truth orders13 , denoted by multi-reference performance, and the subset containing these instances is denoted as the multi-reference subset.14 Statistics.Table 5 lists the essential statistics of the multi-reference subsets, including the counts of the multi-reference instance for each dataset and modality, as well as the per-instance statistics.
Multi-Reference Performance.The noticeable main competitors in  of their best performing variants with the selected metrics.Several trends still hold: (1) Multimodal models still outperform the text-only counterparts.
(2) Human performance is still well above models' even under multi-reference setups.Additionally, both humans and models perform significantly worse in the multi-reference subset when single (original) ground truth is enforced, implying the validity of our alternative order annotations.
We originally hypothesize that enforcing the original authored order to be the only ground truth would be unfair to the text-only models, as images can often better represent the detailed scene changes omitted by the texts, while in reality certain steps may not need to strictly follow the authored order.Judging from the number of instances that improve after evaluating with alternative orders, the text-only model indeed benefits more from the multi-reference setup.Examining the general trends in Table 4, one can conclude that the textual contents indeed posses certain levels of ambiguities where images can help to alleviate.However, as the performance gaps between multimodal and text-only models are still significant under the multi-reference settings, advantages of multimodality.Note that humans achieve perfect performance on the multi-reference subset in RecipeQA, though unlikely it may seem, it is mainly due to recipes tend to have rarer possible alternative orders.We again only list the categories with total instance count >10.
WikiHow Categories.Table 6 lists the WikiHow categories with the most (top-5) annotated multireference ground truths.Note that the categories with more annotated alternative ground truths are also among the worse performance from both humans and models (Figure 3).We provide sample qualitative inspections in Append.Sec.C.1.

Related Work
Sequence Ordering.Story sequencing test is a popular way of examining children's abilities on sequential reasoning which is shown evident for procedural understanding (Tomkins, 1952;Baron-Cohen et al., 1986;Loucks et al., 2017).In NLP, existing works attempt the sequencing task as sorting a series of unordered sentences (Chen et al., 2016;Cui et al., 2018;Logeswaran et al., 2018;Oh et al., 2019;Lee et al., 2020;Calizzano et al., 2021) from paper abstracts or short paragraphs.
While certain prior work also attempts to extend it to incorporate multimodality (Agrawal et al., 2016), the dataset used, Visual StoryTelling (Huang et al., 2016), features album images that were not intended to be procedural nor supply unstated details to complement the texts.In computer vision, existing work leverages shuffle frame prediction for learning video representations (Lee et al., 2017;Xu et al., 2019;Wang et al., 2020;Li et al., 2020) as well as cycle consistency constraints for learning temporal dynamics (Epstein et al., 2021).Zellers et al. (2021) also features a pairwise relative frame re-ordering objective to learn temporal common sense from scripted videos, however, as their downstream tasks mainly concern visual reasoning and ordering by frame-text-matching (also on Visual StoryTelling), the re-ordering objective is more focused on the visual modality.Our work takes a different perspective to tackle a comprehensive multimodal sequencing task with a focus on the procedural task-solving knowledge and gauging the helpfulness of complementary information in different modalities.
Task/Procedure Understanding.Other works have utilized WikiHow for learning task knowledge.
In NLP, textual descriptions of WikiHow have been used for abstractive summarization (Koupaee and Wang, 2018), procedural understanding (Zhou et al., 2019;Tandon et al., 2020), and intent estimation (Zhang et al., 2020a).Prior work (Zhang et al., 2020b) considers WikiHow for learning event temporal ordering, but limited to only pairwise relations.A concurrent work uses WikiHow to infer visual goals (Yang et al., 2021).We hope our curation can help advancing the goal of comprehensive multimodal procedural understanding.
Another popular form of comprehending given procedures is through a multiple choice machine comprehension task.Prior work has utilized text book figures (Kembhavi et al., 2017) as a holistic "reading reference" for models to select the correct order of certain (textually described) events from given multiple choices.Another work attempts the original visual ordering task of RecipeQA (Liu et al., 2020) (also an multiple choice task).However, we argue that our task tackles a more complex task as the desired orders need to be directly derived and the event-wise complementary multimodal understanding is not an essential component in these existing works.
Multimodality.Beside models used in this work, there are several recent advanced multimodal grounding techniques (Tan and Bansal, 2019;Li et al., 2019;Lu et al., 2019;Su et al., 2020;Chen et al., 2020b;Huang et al., 2020;Wen et al., 2021).We utilize VisualBERT and CLIP-ViL for their simplicity to be adapted to our task and easier integration to our proposed pretraining techniques, however, our framework is able to incorporate any of the aforementioned multimodal models.

Conclusions
In this work we present studies of language and multimodal models on procedure sequencing, leveraging popular online instructional manuals.Our experiments show that both multimodality and our proposed sequence-aware pretraining are helpful for multimodal sequencing, however, the results also highlight significant gaps below human performance (∼ 15% on PMR).
We provide insights as well as resources, such as the multi-reference annotations of the sequencing task, to spur future relevant research.We also anticipate that the alternative orders defined and annotated in our work can benefit more comprehensive task-procedure understanding.Future work such as predicting task steps which can be parallel or interchangeable, and understanding step dependencies can be explored.
We hereby acknowledge that all of the co-authors of this work are aware of the provided ACM Code of Ethics and honor the code of conduct.This work is mainly about sequencing a given series of multimodal task procedures, represented by text descriptions along with their images.The followings give the aspects of both our ethical considerations and our potential impacts to the community.
Dataset.We collect the human performance on our sequencing task (both the standard human performance and the alternative order annotations) via Amazon Mechanical Turk (MTurk) and ensure that all the personal information of the workers involved (e.g., usernames, emails, urls, demographic information, etc.) is discarded in our dataset.While the sequence orders either from the original author intended ones or those annotated by the workers for the standard performance may possess unintended biases against certain population group of people (e.g.due to cultural differences or educational differences, some tasks may be performed differently from the original intended orders), we anticipate the additional multi-reference annotation can alleviate such an issue as well as provide a broader view to approach procedural understanding, i.e. certain task-steps can be interchanged.
This research has been reviewed by the IRB board and granted the status of an IRB exempt.The detailed annotation process (pay per amount of work, guidelines) is included in the appendix; and overall, we ensure our pay per task is above the the annotator's local minimum wage (approximately $12 USD / Hour).We primarily consider English speaking regions for our annotations as the task requires certain level of English proficiency.
Techniques.We benchmark the proposed sequencing task with the state-of-the-art large-scale pretrained language and multimodal models with our novel sequence-aware pretraining techniques.As commonsense and task procedure understanding are of our main focus, we do not anticipate production of harmful outputs, especially towards vulnerable populations, after training models on our proposed task.

A.1 Image Contents
For simplicity and computational concerns, in this work we only pair one image to each of its associated task-step textual descriptions.However, in both WikiHow and RecipeQA, each task-step can have more than one associated images or visual contents represented by short clips or GIFs.We simply select the first image, which is supposed to be the most representative, for those step featuring multiple images; and sample the frame in the middle of time interval for clips or GIFs.Nevertheless, our framework does not assume any limitation on how many images per step to be processed.

A.2 WikiHow Categories
The category in WikiHow generally forms a hierarchical directed acyclic graph.Each category can have its relevant subcategory, which usually spans finer-granularity of category types.For example, a possible category traversal path is: Cars and Vehicles →Public Transport →Air Travel, which can lead to the article How to Overcome the Fear of Flying.We attach these full category traversal paths as an additional feature to each of the article in our dataset, and we also will provide a complete list of the taxonomy composed by all the categories and subcategories in WikiHow.We include the category-data counts in Table 7 for a reference, where we only show the top-level category here.The more in-depth categories can be referred to in the full released version of the dataset.

A.3 Train-Dev Splits
For RecipeQA we use the original data splits which ensure no identical recipe appears in more than one set (each recipe has its unique recipe-id), as this dataset only has one category and the data quality is much more uniform than that of WikiHow, i.e. most recipes fulfill our target dataset criteria.
For WikiHow, we split the data according to the third level category to prevent models from exploiting too similar task knowledge in the same category, where the level (three) is empirically decided.Specifically, we ensure that the third-level categories where the articles in our golden-test-set belong to, do not appear in the train set.We first split the WikiHow dataset into train, development, and test set following this strategy, and then construct our golden-test-set by sub-sampling a subset

B Details of Human Annotation
B.1 Golden-Test-Set Selections In order to construct a high-quality test set for humans to evaluate, we manually select the samples which meet our general criteria: (1) the tasks are procedural in both texts and images (2) the task's images are designed to complement the textual descriptions or provide a more illustrative information for some unstated implicit knowledge.We ask three of our internal members (co-authors) to perform such manual selection, and preserve ones that have majority votes.In total, we select 300 samples for WikiHow and 100 samples for RecipeQA.

B.2.1 Standard Performance Benchmark
We collect the human performance via Amazon Mechanical Turk (MTurk).Each MTurk worker is required to read the provided instruction carefully, as shown in Figure 5a, and then perform the task, which is designed to be done in an intuitive drag-ndrop (illustrated in Figure 5b) fashion.Each MTurk HIT is designed to have five sets of sequencing tasks followed by a few additional questions such as confidence level of the worker when inferring the order, and whether different modalities are helpful in a particular task.For each unique sample in the selected golden-test-set, we construct three annotation sets each for one modality version: multimodal, text-only, and image-only.We launch the HITs containing the same sample but with different modalities with a week gap to prevent potential memorization if the same worker happens to annotate the exactly identical data sample.We estimate the time required to complete each of our HITs to be 10-15 minutes, and adjust our pay rate accordingly to $2 or $3 USD depending on the length of the task.This roughly equates to a $12 to $15 USD per hour wage, which is above the local minimum wage for the workers.In total we receive annotated HITs from around 80 workers for WikiHow, and 14 workers for RecipeQA.
In order to ensure annotation quality and filter potential MTurk spammers, we design a few sets to be our qualification rounds for later on worker pool selection.The Pearson correlation between the performance of the qualification samples and the overall HIT performance is 0.6 with p-value < 0.05.Since it is positive correlated and significant, we censor assignments with substantially low overall performance (<20% on accuracy metric), and relaunch the HITs containing those samples for a few more rounds for higher quality annotations.
Finally, since the agreement is sufficiently high (see Section 3.2), we simply compute the human performance using all of the collected annotated orders from all the participated workers, which result in reasonably high human performance upper bound for our proposed sequencing task.

B.2.2 Annotating Alternative Orders
We deliberately ask a different set of MTurk workers than those participated in the standard performance benchmark round for annotating the alternative orders.In total we receive HITs from around 70 workers for WikiHow, and 40 workers for RecipeQA.The monetary rewards and other general settings follow the same procedure as in the standard performance collection.We compute pairwise IAAs for each worker against every other workers, using the method described in Append.Sec.B.3, and then we place a threshold to filter out workers that tend to have too low IAAs (which is a likely indicator that a worker is either a spammer or not understanding our task well).As the final IAAs among the selected pool of workers are sufficiently high (see Section 3.2), for each instance we perform a majority vote on the annotated alternative orders to serve as the final multi-references.

B.3.1 Standard Performance
As orders concern not only positioning of the items but also more complicated relative information among the items in a sequence, we propose to measure the agreements among orders centering around the concept of pairwise relationship.Specifically, we transform an integer sequence order to an one-hot encoded representation of the N 2 pairs of relative relations.Consider an example: suppose three items (1, 2, 3) are to be ordered, and all the pairwise relations are {12, 13, 21, 23, 31, 32}.The transformed one-hot representation is defined as: R 123 = {12: 1, 13: 1, 21: 0, 23: 1, 31: 0, 32: 0} = {110100}, i.e. , R(ij) = 1 iff ij is a valid relatively ordered pair.Similarly, R 231 = {001110}.
Using the aforementioned definition of R, we can compute Cohen's Kappa inter-annotator agreement score for a pair of annotated order per each instance.The overall scores can be computed by firstly taking the average of pairwise Kappa scores of annotations for each instance, and then taking the average across the entire dataset.

B.3.2 Alternative Orders
To evaluate the agreements for the alternative orders, we focus on the differences between an order and the ground truth in their transformed representations.We first compute the one-hot difference between an alternative order to the ground truth order, e.g.suppose ground truth order is simply o g =123, and an alternative order is o 1 =132, then R dif f og,o 1 = abs|{110100} -{110001}| = {000101}.To focus on the agreements of the differences to the original ground truth, we apply the Kappa score on a pair of orders by retaining the union of the positions where each order differ from the ground truth in their onehot representations.For example, if o 2 =213, then R dif f og,o 2 = abs|{110100} -{011100}| = {101000}, and hence the differences to the ground truth are at positions 4, 6 from o 1 and 1, 3 from o 2 , i.e. the union is {1, 3, 4, 6}.Computing the Kappa scores Algorithm 1 Alternative Order IAA Per Instance Require: {A n } N n=1 : A list of annotation series, where A n = {a n,k } Kn k=1 denotes K n orders annotated by nth worker for an instance.Require: f (x, y): IAA scoring function.
1: Initialize S: empty score list 2: for i = 1 to N do 3: for j = i + 1 to N do 4: One-hot encode {a i,k }, and {a j,k } 5: Assume K i < K j // otherwise swap 6: while {a i,k } not empty do 7: Find best match according to R dif f 8: end for 16: end for 17: return mean(S) on R dif f og,o 1 and R dif f og,o 2 at these positions leads to computing the scores on lists {0011} and {0110}.
To compute the agreements of two series of alternative orders from two annotators (the series can have different lengths), we first iteratively find all the best matching pair of orders from the two series (each order in a series can only be matched once).When one series contain more orders than the other, the remaining unmatched orders will be compared to the ground truth to serve as the penalty.For a particular instance, we take the mean of all the Kappa scores (the best-matching-pair and penalty scores) as the IAA for the two annotators, as detailed in Algorithm 1.The overall IAA is computed similarly to the standard case.

B.4 Additional Statistics
Apart from the main sequencing task, we also ask the annotators for their confidence of predictions and if multimodality is helpful for deciding the order in the standard benchmark round.We hereby provide two more statistics obtained from the workers: the percentages of confidence levels and which modality (modalities) helps for deciding the order.Modality Helps.As which modality is potentially more helpful, we include the percentages of each answer category in Table 8.It can be noticed that majority of workers (> 60%) think that multimodal (both modalities) is helpful, and especially in the recipe data, there are > 90% of workers indicating the effectiveness of utilizing multimodal inputs.
Confidence Levels.As shown in Table 9, majority of workers feel at least fairly confident (score of 4) about their predictions, which can justify the validity of our selection of golden-test-set.

C.1 Qualitative Inspections
Figure 4 shows a few qualitative examples in different categories.Figure 4a shows that while step 1 and 3 may seem confusing if only looking at the texts, the images can help deciding the proper order, whereas models may fail to grasp such multimodal information in Figure 4b.In Figure 4c we show an example where multi-reference benefits both humans and the models, although in reality it should be more commonsensical to stir before refrigerating the mixtures.

C.2 Image-Only Multi-References
We also provide the detailed multi-reference performance break down on the image-only modality using the best performing models in Table 2, CLIP, in Table 10 for references.

D More Model Details
Multimodal Model Considerations.Bugliarello et al. (2020) suggests that many V&L models can the search bounds and number of trials in Table 12, that all of our models adopt the same search bounds and ranges of trials.

D.3 WikiHow Images
Although the images in WikiHow can often be synthetic or "cartoon-ish", we observe that modern object detectors can still propose meaningful regions, regardless of whether the object class prediction is sensible or not.We include some predicted bounding boxes in Figure 6 for references.And hence, although there may be concerns on suboptimal visual understanding from these images, we do believe both of our ResNet and CLIP visual encoders can extract reasonably useful features.

E Releases & Codes
The scraped WikiHow dataset will be released upon acceptance, along with a clearly stated documentation for usages.We will also release the code for processing the RecipeQA dataset particularly for our procedure sequencing task, where the original dataset can be obtained from their project website. 21If permitted by the authors of the BERSON model, we will also release the cleaned code repository which encompasses the majority of the implementations in this work upon acceptance.We hope that by sharing the datasets and their essential tools, more interest could be drawn into research on multimodal procedure understanding and its future research directions.

Figure 1 :
Figure 1: Multimodal task procedure sequencing: The left column shows unordered instruction steps from the manual How To Make Wood Signs.Each step is a text description and its associated image.Without the complementary information from the visuals, a novice may have difficulty inferring the proper task order.Considering multimodal information, the proper order can be correctly inferred (right column).

Figure 2 :
Figure 2: Sequence-aware pretraining includes: (1) masked language modeling (MLM), (2) image-swapping prediction (ISP/PISP) which requires the model to predict if some images (image-patches) are swapped, and (3) sequential masked region modeling (SMRM) where models are asked to reconstruct masked regions in each image within the input sequence.

Figure 3 :
Figure 3: Top-3 and least-2 categories of human-model performance difference (in PMR): The selected categories have >10 samples.The difference bars on the multimodal model series are compared against the text-only model series.

Figure 5 :Figure 6 :
Figure 5: MTurk Annotation User Interface: (a) We ask the annotator to follow the indicated instruction, and perform the sequencing task.(b) The annotation task is designed for an intuitive drag-and-drop usage, followed by a few additional questions such as confidence level and whether each modality helps.(This example is obtained from RecipeQA dataset.) Table 1 presents the essential statistics of the two datasets (more details are in Append.Sec.A).

Table 1 :
General statistics of the two datasets: We provide the detailed component counts of the datasets used in this work, including the statistics of tokens and sentences from the instruction steps (lower half of the two tables).

Table 2 :
Golden-test-set performance: Models which take multimodal inputs (for both VisualBERT and CLIP-ViL encoders) consistently outperform the ones that only take unimodal inputs.Our proposed sequence-aware pretraining is shown consistently helpful throughout the three modality variants.Humans show larger performance gain when both modalities of inputs are provided, and are more robust to the local ordering as implied by the smaller gaps between Lq and Lr.

Table 3 :
Model ablation studies: We provide a performance breakdown for incremental combinations of the pretraining objectives, ablated on the best performing models (CLIP and CLIP-ViL) from Table2for each dataset and modality.

Table 4 :
Multi-reference performance: ( † denotes human performance) Our golden-test-set can be decomposed into two subsets: Single where each instance in this subset only has one single originally authored ground truth, and Multi.where each instance features multiple ground truths from alternative orders.For the Multi.subset,twotypes of performance can be computed: single considers only the originally authored ground truth and multi computes the multi-reference performance.All denotes the entire test-set combining the results from Single and Multi.subsets.Results are reported on the two main competitors: multimodal and text-only using the best performing models from Table2in each modality.% of instances benefit w. multi-reference indicates that of what percentage of instances in each multi-reference subset humans and the models benefit (for each instance if its performance improves in any of the metrics) from alternative ground truth orders.

Table 5 :
Multi-reference subset statistics: We report the count (cnt) of multi-reference instances in each dataset across the three modalities, and their basic statistics.

Table 6 :
Top-5 mean alternative orders by categories: We list top-5 categories in WikiHow according to the number of average ground truth references in their multi-reference subset.

Table 7 :
Top-Level Categories of WikiHow: Number of unique articles in each top-level category of the WikiHow dataset.The categories are sorted by alphabetical order.In total there are 19 top-level categories (same as what this page indicates: https://www.wikihow.com/Special:CategoryListing),and one "others" category for standalone leaf nodes without real linkages to these top-level categories.

Table 8 :
Which modality helps?We compute the percentage of each answer category.In both datasets, majority of the annotations indicate that both modality are helpful for deciding the orders.

Table 9 :
Confidence Level Statistics (%): In both datasets, majority (> 80%) of the annotators indicate at least > 4 (fairly) confidence level, which can help justify the validity of the human performance.