Multimedia Generative Script Learning for Task Planning

Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities. An important aspect of this process is the ability to capture historical states visually, which provides detailed information that is not covered by text and will guide subsequent steps. Therefore, we propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 5,652 tasks and 79,089 multimedia steps. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. We propose to encode visual state changes through a selective multimedia encoder to address the multimedia challenge, transfer knowledge from previously observed tasks using a retrieval-augmented decoder to overcome the induction challenge, and further present distinct information at each step by optimizing a diversity-oriented contrastive learning objective. We define metrics to evaluate both generation and inductive quality. Experiment results demonstrate that our approach significantly outperforms strong baselines.


Introduction
Robots rely on understanding the present realworld state and predicting the subsequent steps to better assist humans in daily stereotypical tasks such as meal preparation and gardening (Ruth Anita Shirley et al., 2021;Liu et al., 2022). As an example, Robohow (Beetz et al., 2016) uses articles from WikiHow 2 to assist robots in everyday tasks The upper box shows the task input, including the goal and multimedia step history. Each step contains a text description and an illustrative image. The output is the next step. We retrieve historically relevant steps from the training corpus.
in human working and living environments. However, the problem is that not all daily tasks are well documented. Thus, generating a sequence of steps that lead to a given goal (i.e., goal-oriented generative script learning) (Lyu et al., 2021;Huang et al., 2022;Li et al., 2023;Zhou et al., 2023;Liu et al., 2023) has a fundamental importance in allowing robots to perform unseen tasks by understanding the patterns in previously observed similar tasks.
Despite this, previous goal-oriented generative script learning focuses solely on text (Lyu et al., 2021;Huang et al., 2022), which is commonly affected by reporting bias (Gordon and Van Durme, Figure 2: Architecture overview. We use the example in Figure 1 as the walking-through example. 2013) as important details may be omitted in the source text. However, such information is often implicitly contained in images. For example, in Figure 1, the image of Step 1 illustrates the items needed to make a bracelet, which is not mentioned in the text but helps predict the action of threading beads as a future step. Existing multimedia script learning work seeks to bridge this crossmedia gap, but the task settings are multi-choice selection (Yang et al., 2021b) or ordering (Wu et al., 2022), which require candidate steps as input so it is not a practical setting for real-life robots.
To address these problems, we propose a new task, Multimedia Generative Script Learning (Figure 1), that requires systems to generate future steps based on the goal and previous steps with visual scenes depicting their states. Specifically, given the goal and previous step history in the form of natural language sentences paired with descriptive images, the model should automatically generate the natural language instruction for the next step. A good script has three hallmarks: (1) Visual-State Trackable: it records the historical visual scenes and recognizes significant changes that impact future steps. We call it multimedia challenge. To address this challenge, we focus on salient differences in visual scenes, and propose a novel selective multimedia encoder. Rather than learning directly from the visual details of each object, we first leverage an image captioner as an abstract summary of the image about global interactions among multiple objects. We then introduce a selection gate to focus on the selected captions and steps closely related to the future step. For instance, the second caption "a child's hand with a measuring tape on it" in Figure 1 can be filtered out by the selection gate because it is not closely related to the future steps.
(2) Inductive: it transfers knowledge from a previously observed task to similar unseen tasks. We call it induction challenge. To induce procedural knowledge from previously observed tasks, we propose a retrieval augmented decoder to obtain relevant steps to guide the subsequent step generation. For example, the future step in Figure 1 closely resembles the scripts used in previous retrieved steps about threading items, thus transferring script knowledge to an unseen task.
(3) Diverse: it displays distinct information at each step. We call it diversity challenge. Existing pre-trained transformer-based language models such as T5 (Raffel et al., 2020), BART (Lewis et al., 2020a), and GPT-2 (Radford et al., 2019) tend to generate repeated or highly similar future steps as shown in Figure 1. Therefore, we introduce a novel diversity-oriented contrastive learning objective to control all subsequent steps to convey different information. We treat all other steps in the given input and retrieved steps in other tasks similar to the given input as hard negatives.
In addition to traditional generation-based metrics to evaluate task performance, we propose a new multimodal-retrieval based metric to capture cross-modal semantic similarity. While the model design can be applied to any domain of interest, we experiment with the model on two domains Gardening and Crafts, where task planning has not been well researched. Automatic evaluation shows that our generated step predictions are close to the human written ground truth. Human evaluation further confirms that our diversity-oriented contrastive learning objective leads to diverse and correct steps.
The contributions are threefold: 1. We propose the first multimedia goal-oriented generative script learning task to record historical steps in both text and images. We also release a new benchmark from WikiHow, featuring 5,652 tasks and 79,089 multimedia steps.
2. We propose a novel approach to produce visually trackable, inductive, and diverse scripts through a selective multimedia encoder, a retrieval augmented decoder, and a diversityoriented contrastive learning objective.
3. We propose a new multimodal-retrieval based metric to evaluate the cross-modal semantic similarity and the inductive ability by checking factual correctness.

Problem Formulation
We propose a new multimedia generative script learning task: given an activity goal G, an optional subgoal M that specifies the concrete needs, and the previous multimedia step history H n = {(S 1 , V 1 ), ..., (S n , V n )} with length n, a model is expected to predict the next possible step S n+1 , where S i is a text sequence and V i is an image.

Dataset Collection
Using articles from Gardening and Crafts categories as case studies, we create a new dataset based on the English WikiHow dump (2021/05). There are typically three levels of hierarchy in a WikiHow article: goals which describe the overall task, subgoals which represent the intermediate process to accomplish a goal, and steps which are the specific actions to complete a subgoal. For each WikiHow article, we collect step-image pairs as well as their goals and methods 3 . We split the whole dataset based on the task categories. Therefore, the validation and test sets contain tasks not included in the training set. Table 1 shows the detailed data statistics.

Model Architecture
The overall framework is illustrated in Figure 2. Given the activity goal G, optional subgoal M , and multimedia step history H n , we first use an image captioner to map each input image into a precise caption and produce the caption-enhanced step historyĤ n . Then we propose a selective multimedia encoder by extending the BART encoder with a gated fusion layer to learn contextualized representations for the step history. After that, a 3 We only keep steps that contain both images and texts. retrieval module retrieves historically relevant steps from the training corpus and encodes them with a retrieved step encoder. Finally, we introduce a retrieval-augmented decoder, which enhances the BART decoder with a retrieval gate fusion layer to fuse the representations of the input step history and retrieved steps to generate the next step. The entire model is trained by our proposed diversityoriented contrastive loss and cross-entropy loss.

Selective Multimedia Encoder
Image Encoding Compared to step descriptions which focus more on action description, captions provide more visual environment/object information such as beads in Step 1 from Figure 2. Because we are more concerned with the overall semantics of the salient objects in the image rather than the details of every object, we adopt image captioners to encode visual features and track visual state changes. For instance, while multiple objects are present in Step 3 in Figure 1, the finger object can be ignored in the third step as it does not represent the key information conveyed by the image. Specifically, we use the state-of-theart image captioner BLIP (Li et al., 2022), which is pretrained on a large-scale vision-and-language corpus with 129M images to generate a caption C i for each image V i in the input step history H n . After that, we obtain the caption-enhanced step historyĤ n = {(S 1 , C 1 ), ..., (S n , C n )}, where C i is the caption of the image V i in step i. Selective Multimedia Encoding To help the encoder capture the activity goal and subgoal information, we concatenate goal G and optional subgoal M to serve as the first sequence in the history X 0 = [G, M ]. For the subsequent steps in the history, we concatenate each step and caption as X 2i−1 = S i and X 2i = C i . To summarize the step history, we prepend a learnable [CLS] token to the sequence as a contextualized vector. The entire text sequence is then represented as X = {[CLS], X 0 , X 1 , ..., X 2n }. We pass the text sequence X into a BART encoder to get the contextualized hidden representation H = {h 0 , ..., h 2n } as the hidden states for sequence X j , where L X j is the length of X j .
Since the input sequence contains steps or captions not directly relevant to the future step, we need to mask those sentences based on the step/caption representations. For instance, in Figure 2, the step description for Step 1 is vague and needs to be masked. We treat the representation of the [CLS] token, h 0 , as the contextualized representation of the entire step history and use it to compute a mask that filters out the irrelevant step/caption information. Specifically, we use h 0 as query and H X j as both the key and value to compute Multi-Headed Attention (MultiHead) (Vaswani et al., 2017) for each sequence hidden states H X j :ĥ X j = MultiHead(h 0 , H X j , H X j ), whereĥ X j is the weighted representation for text sequence X j . Then, for each sequence X j , we can calculate the mask probability as: Sengupta et al. (2021), we update the hidden states for each sequence X j asH

Step Retrieval Augmentation
Historically Relevant Step Retrieval In addition to the caption-enhanced step history,Ĥ n , we retrieve historically relevant steps R n+1 = {R 1 , ..., R k } from the training tasks, where k is the number of retrieved relevant steps. We first use SentenceBERT (Reimers and Gurevych, 2019) to encode all steps. We then retrieve k steps from the training corpus, which have the top-k highest cosine similarity to the previous step S n from the representation given by SentenceBERT 4 . Finally, we consider the immediate next step for each of those k steps as potential relevant steps R n+1 . For instance, because Step 5 in Figure 2 is similar to pull the thread out in the training corpus, we choose its immediate next step thread the bobbin as a historically relevant step.

Retrieved
Step Encoder For historically relevant steps R = {R 1 , ..., R k }, we apply the BART encoder to get hidden states Similarly, we use h 0 in multimedia encoder as the query and H R i as both the key and value to compute multi-headed attention for each sequence hidden states: whereĥ R i is the weighted representation for step sequence R i . Similarly, we can calculate the mask probability as: where W β is a learnable parameter. We then update the hidden states for each sequence R j asH R i = β j · emb [MASK] + (1 − β j )H R i . The final hidden 4 We use the previous step Sn instead of all history since it is more temporally correlated to the next step. state sequences isH R = [H R 1 ; ...;H R k ].

Retrieval-Augmented Decoder
In the decoder, we compute the probability P s q |s <q ,Ĥ, G, M for the q-th token s q ∈ S n+1 . Our retrieval-augmented decoder is similar to (Liu et al., 2021), which aims to capture historically relevant steps related to the next step based on previous decoder hidden states. Given z l q which is the hidden state of s q in layer l, we first use a multi-head cross-attention to fuse the hidden states from the retrieved stepsH R : z q l = MultiHead(z l q ,H R ,H R ). We also append a gating mechanism to control the knowledge from the retrieved steps and previous hidden states: where W γ is a learnable parameter and LN( * ) is the layer norm function. Finally, the fused hidden states in the top layer are used to compute the generation probability. We supervise the next step generation using the standard cross-entropy loss: log P s q |s <q ,Ĥ, G, M

Diversity-Oriented Contrastive Learning
In the experiment, we observe that the model tends to keep generating similar future steps in a row given the beginning steps as input or just paraphrases the input steps. Therefore, we propose a contrastive learning-based loss to encourage the model to return diverse step prediction results. Negative Sampling Sequence-to-sequence models suffer from the "exposure bias" problem (Dhingra et al., 2016; because of teacher forcing. Contrastive loss provides an additional sequence level loss which can help models increase the diversity of the output steps. We adopt two types of negative sampling strategies to discourage the model from paraphrasing the previous step as the future step: self-negatives (Wang et al., 2022) where we consider the input steps as negative samples and retrieved-negatives where we consider the retrieved steps from training corpus which are similar to the input step as negative samples. For example, in Figure 1, the goals and steps from the step history serve as the self-negatives. Given the last step, "cut the thread", we retrieve similar steps from the training set as retrieved negatives which include "cut your thread", "cut off the extra thread", etc.
Diversity-Oriented Contrastive Loss Since the model needs to distinguish between the ground truth and those negative samples, we design a novel diversity-oriented contrastive loss. Specifically, given an input sequenceĤ, G, M , the ground truth next step S n+1 , and a set of K negative samples {S 1 n+1 , S 2 n+1 , ..., S K n+1 }, we aim to maximize the probability of classifying the positive sample correctly with the InfoNCE loss (Oord et al., 2018): whereH + andH − k are decoder hidden states from the positive and k-th negative samples, W y is a learnable parameter, τ is the temperature, and Avg( * ) denotes the average pooling function.

Training Objective
We jointly optimize the cross-entropy loss and our proposed diversity-oriented contrastive loss: L = L gen + λL cl , where λ is a hyperparameter that controls the weight of the contrastive loss.

Evaluation Metrics
Generation Quality Evaluation Following common practice in text generation, we first evaluate our model with BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Denkowski and Lavie, 2014) scores to examine the content overlap between generated steps and ground truth. Inductive Quality Evaluation In order to determine whether the inferred subsequent steps are factually correct, we further evaluate the models with BARTScore (Yuan et al., 2021) and the semantic similarity score (Thakur et al., 2021). The semantic similarity score uses a cross-encoder pretrained on STSBenchmark (Cer et al., 2017) to calculate the semantic similarity between two sentences. In addition to evaluating whether the generated step matches the next step, we also check whether the generated step matches any subsequent step. This enables the model to earn credit if it generates a step that appears in the future. We propose a Multimodal-Retrieval based metric: for each generated step, we use it as a query to search all corresponding step-image pairs under the same subgoal/goal from the testing set. We then compute HIT@1 for results that fall into ground-truth future step-image pairs. Similar to Section 4.3, we use SBERT (Reimers and Gurevych, 2019) to rank the most similar steps under the same subgoal to get Text@1 (T@1). To compute Image@1 (I@1), we use CLIP (Radford et al., 2021) to rank the most similar images under the same subgoal. If the top-1 retrieval results appear in the subsequent steps, we consider it a HIT. The retrieval-based metric captures normalized semantic similarity concerning all related steps under certain subgoals. The CLIPbased retrieval metric also enables the evaluation of the cross-modality semantic similarity. Additional details of the evaluation setup are in the Appendix C.

Model
Gardening   Table 3: Results with automatic evaluation on next step prediction for the gardening domain (%). B-n denotes the BLEU-n score. R-L denotes the ROUGE-L score. Semantic denotes semantic similarity score.  retrieval baselines including a naive retrieval baseline which directly uses retrieved historically relevant sentences as discussed in Section 4.3, and retrieval BART which takes in the concatenation of the retrieved historically relevant sentences with the original text input. We also include (3) multimodal generation baselines that can take image embedding instead of captions as input, which is equivalent to CLIP-BART (Sung et al., 2022 Table 5: Percent (%) of n-grams in step history which appear in human or system steps.

Automatic Evaluation
As shown in Table 3 and 4, our model outperforms baselines. Since our task is open-ended and we are testing on unseen activities, our generated sentences usually contain paraphrases. Therefore, the BLEU scores, which rely on the exact word ngrams match (Goldberg, 2018), are not high. In particular, because our ground truth only has an average length of 11 which contains less 4-grams than the text in other tasks, our BLEU-4 is lower than other text generation tasks. The substantial gap between CLIP-BART and BART or BART with caption indicates that captions usually carry more specific information than images, and the current multimodal encoders still cannot perfectly embed text and images into the same semantic space. Meanwhile, the low performance of the retrieval baselines shows that simple retrieval methods are insufficient to predict accurate next steps.  Table 6: Self-BLEU (%) for human or system steps.
Among our model variants, adding selective encoding leads to a further performance increase, showing that selective encoding helps the model focus on the content in step history that is most related to future steps. The superior performance on BARTScore and semantic similarity of the retrievalaugmented model indicates the effectiveness of the guidance from historically relevant steps. Our contrastive learning model achieves larger gains compared to baselines for BLEU and METEOR, suggesting that our contrastive loss helps the model generate results similar to the ground truth. Automatic Evaluation with Future Steps We evaluate whether the predicted step is related to any future steps. Our contrastive learning model outperforms other ablations significantly on text retrieval for the Gardening domain, as shown in Table 2. These results imply that the contrastive learning objective encourages the model to generate more informative future steps. The decrease in n-gram overlap between input step history and step predictions (Table 5) suggests that the contrastive learning objective also decreases the model's paraphrasing tendency. Interestingly, the performance decreases when adding the retrieval augmentation to the model because the retrieval model introduces additional information related to the step history, which makes the model generate results similar to previous steps (Table 5). Automatic Evaluation on Diversity To evaluate the diversity between generated steps in the test sets, we employ two diversity metrics: self-BLEU (Zhu et al., 2018) (Table 6) and unique n-grams (Fedus et al., 2018) ( Table 7). The self-BLEU evaluates whether a model produces similar n-grams  in different samples by measuring the similarity between one sentence and the rest in the test set. The retrieval model achieves the best results for the Gardening domain because it acquires additional knowledge from the retrieved steps and thus diversifies the output. The contrastive learning model achieves the best self-BLEU for 3,4 grams for the Crafts domain, implying our model's effectiveness. The unique n-grams calculate the percentage of distinct n-grams. It considers the repetition of ngrams within a generated step and across samples. The contrastive learning model achieves the highest distinct scores for 3,4 grams for both domains, indicating the effectiveness of our diversity-based contrastive loss in generating more diverse steps.  Since script learning is an open-ended task that is inherently difficult for automatic metrics to measure the correctness of generated scripts (Huang et al., 2022), we further conduct a human evaluation. We hire four proficient English speakers as human annotators to independently rank the generation results from 1 (best) to 5 (worst) for: (1) next step correctness which measures whether the generated results match the next step; (2) future steps correctness measuring whether the generated results match any of the future steps; (3) diversity which measures the diversity of generated results under the same subgoal; (4) executability which checks the generated results repeat or conflict with step history. We randomly select ten subgoals, including 41 and 44 generated steps from the test set for Gardening and Crafts separately.

Human Evaluation
The human evaluation results 5 are shown in Table 8. Our contrastive learning model performs best over all metrics on two datasets. By adding each component of our model, we observe a consistent trend in correctness to ground truth. However, we also observe that scores for selective encoding decrease because the output space with selective encoding is more constrained than the BART baseline, and the length of our generated sequence is not very long.

Discussions
Impact of Selective Multimedia Encoder The caption input helps the model understand the general step descriptions better. For example, given the activity "cure azaleas of leaf gall", the step text only shows a generic instruction: "rule out other diseases". However, the BLIP captioner generates "a green leaf with white dots on it" which helps the model generate "remove the leaf gall from the shrub" instead of "keep your shrub healthy". Furthermore, in Figure 1, the finger object is absent from caption 3, indicating that the caption model has the ability to eliminate extraneous information from the image. The selective gate can filter out unrelated steps which are not directly related to the current subgoal. For example, in Figure 1, our model successfully predicts a low masking weight of 0.049324 for the step "cut the thread", while assigning a much higher masking weight of 0.134498 to its uninformative caption "a pair of scissors and a measuring tape". The results imply that the selective gate successfully guides the model to focus on the related information. Impact of Retrieval Augmentation The retrieved steps provide relevant knowledge from similar tasks: given the subgoal "finding or growing roses" because the retrieved sentence mentioned "fertilizer" and "mulch", the model successfully generates "fertilize your roses". Additionally, the model also benefits from retrieval augmentation with an analogy, e.g., the model generates "know when to harvest" given the retrieved step "plant the bulbs when you get them". Impact of Contrastive Learning In addition to 5 The Krippendorff-α inter-annotator agreement scores (Krippendorff, 2018) and detailed guidelines of human evaluations are in the Appendix K the improvement in diversity from the previous section, we observe that contrastive learning helps the model generate results closer to ground truth compared to other baselines. For example, it generates "pick creeping charlie plants from the ground", similar to ground truth "pick your creeping charlie leaves". The addition of contrastive learning also helps our model generates instructions with more details than other baselines by stating "place the plant in the hole and cover it with soil" instead of "place the plant in the hole". Despite promising results, their performance heavily relies on the given candidates, making them difficult to generalize for unseen activities. The second category is text-based generative script learning (Tandon et al., 2020;Lyu et al., 2021;Huang et al., 2022;Li et al., 2020Li et al., , 2021Jin et al., 2022;Sancheti and Rudinger, 2022). However, this is the first work to provide a multimedia goal-oriented generative script learning along with a new multimodalretrieval based metric. Different from Sener and Yao (2019), which uses a video to generate the next step, our new task uses step image-text pairs as input. Unlike previous multimedia script learning frameworks with a multimedia encoder to capture visual and textual information, we use a captioner to convert images into captions summarizing the important objects in images. The GOSC dataset (Lyu et al., 2021) contains the steps of daily stereotypical tasks, but most of the steps (52.6%) in this dataset are unordered, making it infeasible to evaluate the next-step prediction. Consequently, we adapted the best model mT5 (Xue et al., 2021) in GOSC to our settings, i.e., the monolingual version T5, and used it as an equivalent baseline to show the comparison with the state-of-the-art model. To handle irrelevant sentences in the input, instead of using a token-level gating mechanism that only depends on the token itself (Sengupta et al., 2021), we introduce a sentence (step/caption) level gating mechanism whose gates depend on global context and weighted sentence representations. Our work is also related to retrieval-augmented text generation models (Wang et al., 2019;Lewis et al., 2020b;Liu et al., 2021). However, instead of retrieving knowledge from an external corpus, we use steps from similar tasks in training data to guide the generation process. Moreover, we introduce a new contrastive learning loss to increase diversity. Previous contrastive learning-based text generation methods usually use negative samples constructed by sequence manipulation (Cao and Wang, 2021;Hu et al., 2022) or perturbation (Lee et al., 2021). Inspired by Wang et al. (2022) which uses self-negatives for knowledge graph completion and that the generation output tends to repeat the input, we extend self-negatives for sequence-to-sequence contrastive learning. We also retrieve similar steps from the training set as additional hard negatives.

Conclusion
We propose a novel Multimedia Generative Script Learning task with the first benchmark featuring step and descriptive image pairs to generate subsequent steps given historical states in both text and vision modalities. Moreover, we build a new script learning framework consisting of a selective multimedia encoder, a retrieval-augmented decoder, and a diversity-oriented contrastive learning objective to generate the next steps. Furthermore, we define a new multimodal-retrieval based metric which can be used for multimedia script learning tasks. Automatic and human evaluation results demonstrate consistent performance improvements.

Limitations of Data Collection
Regarding data collection, we crawled the English WikiHow website from Jan 2021 to May 2021. The number of available activities is limited by the data we crawled from WikiHow. We currently only choose Gardening and Crafts categories as case studies. Because we focus on multimedia imagestep pairs, we remove steps that are not attached to any illustrative images. We also observe that a small portion of activities in the dataset do not follow chronological order.
Since our task focuses on the daily stereotypical tasks which usually require the model to understand the visual environment, the model design can be directly applied to support other domains, such as steps in the cooking videos. In addition, our model can also adapt to scenarios without visual images because the performance of our model only decreases slightly if no caption is provided. We plan to expand our model to other categories written in other languages.

Limitations of System Performance
The model might generate incorrect nouns because of the occurrence of patterns (e.g., "refrigerate the slane for up to 1 year" instead of "refrigerate the purslane for up to 1 year"). In addition, our model sometimes tends to generate generic step descriptions because of insufficient input information, e.g., given the last step "lay the t-shirt out on a clean, flat surface.", the model generates "cut the shirt out" which is vague compared to ground truth "carefully cut around the sleeve". Moreover, the pretrained model might focus more on language modeling instead of inherent logic: for the activity of "make paint can planters", after "removing the label" from the paint can, the BART+CAP generates "read the label". In addition, there is still a small chance that the model generates the same output for various similar inputs.
Because we rely on image captions and retrieval results for step prediction, the upper bound of our generation quality is limited by the performance of the image caption and sentence retrieval modules. Our framework also needs to improve on imbalanced topics in the dataset. For example, the dataset contains more activities about tree for the gardening domain than other gardening-related plants. Because our multimedia generative script learning is a new task, we cannot compare our model with other established state-of-the-art models. Moreover, because WikiHow is a crowd-sourcing website, some everyday activities might have better human annotations than the remaining activities. We plan to include a fine-grained human written step prediction as an upper bound to address this issue.

Limitations of Evaluation
The automatic metrics we chose, including BLEU Some other metrics, such as semantic similarity and multimodal-retrieval based metrics, are based on pretrained models, including Augmented SBERT (Thakur et al., 2021), SentenceBert (Reimers and Gurevych, 2019), and CLIP (Radford et al., 2021). Those metrics might not align with human judgment and might be biased toward pretrained datasets. While we complement it with human evaluation, we only focus on relevance to ground truth and diversity. Although we found fluency is not an issue, it is likely we still need to cover all aspects of generation results.

Ethics and Broader Impact
The type of multimedia script learning framework we have designed in this paper is limited to Wiki-How articles, and they might not be applicable to other scenarios.

Usage Requirement
Our multimedia script learning framework provides investigative leads for multimedia script prediction. Therefore, it is not intended to be used for any activity related to any human subjects. Instead, our system aims to generate step predictions with unseen activities similar to those in the training set. Accordingly, domain experts might use this tool as an assistant to write more constructive instructional scripts that would be too time-consuming for a human to create from scratch. Experts can also use this system to improve writing instruction by adding missing instructions. However, our system does not perform fact-checking or incorporate any external knowledge, which we leave as future work. The IRB board should first approve human subjects who follow instructions generated by our system.

Data Collection
We collect data by crawling the raw official English WikiHow website, which is under Attribution-Noncommercial-Share Alike 3.0 Creative Commons License 6 . We ensure that our data collection procedure follows the Terms of Use located at https://www.wikihow.com/wikiHow:Termsof-Use. Therefore our dataset can only be used for non-commercial purposes. As mentioned in Section 6.3, we perform the human evaluation. All annotators involved in the human evaluation are voluntary participants and receive a fair wage.

A Hyperparameters
Our model is built based on the Huggingface framework (Wolf et al., 2020) 7 . We choose top 5 retrieved historically relevant steps as input for our retrieval model. We choose 5 negative samples for each step during contrastive learning for the gardening domain. Specifically, 4 self-negative samples, including steps and captions, are randomly chosen from the title, method, and step history input. The remaining 1 retrieved negative samples are randomly chosen from top-20 most similar steps retrieved from the training set based on the last step. For the crafts domain, we choose 5 self-negative samples and 5 retrieved negative samples. We set τ as 1 for contrastive loss and λ as 0.5 based on validation performance for the training objectives. We optimize our model by AdamW (Loshchilov and Hutter, 2019)

B Training details
We use BART-base from Huggingface (Wolf et al., 2020) for our method and baselines. We normalize all our input sentences into lower case. We add 5 special tokens for BART-base model including <title>, <method>, <step>, <caption>, <template>, and <cls>. We prepend <title> to goal, <method> to subgoal, <step> to text step, <caption> to step caption, <template> to retrieved step, and <cls> to the beginning of step history input. We truncate our step, caption, goal, and subgoal to 30 tokens and target step to 40. We only choose the closest 10 step-caption pairs. We use BLIP (Li et al., 2022)   with 48G memory with full precision. We choose our best model based on the validation score with BLEU-4 (Papineni et al., 2002) and ROUGE (Lin, 2004). The best validation scores for our contrastive learning model are: BLEU-4 with 2.81 and ROUGE-L with 15.24 for the gardening domain; BLEU-4 with 4.85 and ROUGE-L with 20.25 for the crafts domain. The average training time for each model is 2 to 4 hours. Table 9 shows the number of parameters for each model.

C Evaluation Metrics
We use BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Denkowski and Lavie, 2014) from Microsoft COCO Caption Evaluation package 9 . We use official implementation of BARTScore (Yuan et al., 2021) 10 . We use cross − encoder/stsb − roberta − large which performs best on STSBenchmark (Cer et al., 2017) to compute semantic similarity score from Augmented SBERT (Thakur et al., 2021). For multimodal-retrieval based metric, we use the best sentence embedding model: all − mpnet − base − v2 from SentenceBert (Reimers and Gurevych, 2019) for text retrieval, and the best language-image pretraining model ViT − L/14@336px from CLIP (Radford et al., 2021) for image retrieval. Specifically, we compute the CLIP similarity between the image embedding and the sentence embedding of the target step to retrieve images. All results are based on a single run. We have opted not to include a human performance baseline in our evaluation. This decision was made due to the inherent challenges of assessing human performance in generative script learning, which requires annotators to possess domain knowledge in order to predict the next steps accurately. Moreover, different tasks may require different levels of expertise, experience, or background knowledge, making it difficult to establish a consistent baseline for human performance evaluation.

D Additional Ablation Study
We conducted further ablation experiments, the results of which are presented in Table 11. Our findings show that all ablated models performed worse than our proposed model.   In Figure 3a and Figure 3b, we show the averaged BARTScore and semantic similarity scores of our contrastive learning models in the next step prediction task over different step history lengths. In both figures, we observe that the results with eight step-caption pairs obtain the highest scores. We analyze the reasons as follows. For the instances that contain less than eight history steps, increas-ing the step history introduces more information than noise from the step text and corresponding captions. However, as the step length grows, the additional step-caption pairs introduce more noise than information relevant to the future step. Empirically, the eight-step length achieves an optimal balance between noise and relevant information. Another potential reason is related to the number of instances. In Table 10, we see a clear decline in the number of instances because of our dataset construction strategy. Therefore, the model cannot generalize over long history input.

F Dataset Collections
We crawled the English WikiHow website from Jan 2021 to May 2021. We extract all articles from the crawled website dump in the Gardening and Crafts categories. Each article contains a unique activity. We use BeautifulSoup (Richardson, 2007) to parse the article and obtain JSON files. Each JSON file contains a gardening activity. For each gardening activity, we remove those steps without paired images or steps whose images do not exist in the dump. Then, we use a regular expression to remove the URLs in the steps. We remove those steps that are too short (less than two words) or contain no values. Finally, we remove the activity containing only one step in each subgoal.

G Parallel Steps
In this paper, we focus on predicting correct orders for sequential step prediction since we find that only 18% of the subgoals have one parallel step by random checking 50 subgoals, and 14% contain more than one parallel step. It is more critical to predict correct orders for non-interchangeable steps, such as step 4 and 5 in Figure 1. By using generative methods, multiple steps can be predicted with different probabilities, which can support parallel processes. We also propose the multimodalretrieval-based metric by treating the future steps as a set and checking whether the generation steps fall into the future steps. Similarity between Prediction and Ground Truth Figure 4: The semantic similarity scores (Thakur et al., 2021) between the model predictions and the ground truths versus the semantic similarity scores between the retrieved historical relevant steps and the ground truths in the gardening domain.

H Impact of Historical Relevant Steps
We analyze the relation between the quality of the retrieved historically relevant steps and the quality of the model predictions. The semantic similarity score evaluates the quality of retrieved steps and model predictions, which measures the embedding space similarity between a given text and the ground-truth next step. Pearson's correlation between the semantic scores of historically relevant steps and the semantic scores of model predictions is 0.39 with a p < 0.01. We also illustrate their relation in Figure 4. The results suggest that the performance of our model is positively correlated with the relevance of the retrieved historical steps.
water your roses adequately Step 4:
We measure inter-annotator agreement with Krippendorff-α scores (Krippendorff, 2018). The results are in Table 12. Table 13 shows the annotation examples. Because we do not have a virtual environment to execute those steps, we do not have a good inter-annotator agreement on the executability.

L Sample Output
Future Steps: destroy the infected pieces away from the plant.

Goal: cure azaleas of leaf gall
Step 1: identify your shrub as an azalea.

Next
Step?
Step History: Ground Truth: remove infected leaves.

BART: keep your shrub healthy.
Our Model: remove the leaf gall from the shrub.

Historical Relevant
Step: 1: look for signs of pests.
2: give your plants just the right amount of sun.
Caption 1: a pink flower with green leaves on a blue background 3: look for insect activity.

5: use cultural control.
BART+CAP: remove the leaf gall.a person holding a green leaf in their hand. BART+CAP+ME: remove the leaf gall from the plant. BART +CAP+ME+RD: remove the leaf gall.a person cutting a plant with scissors.
Step 2: rule out other diseases Step Prediction Results. It shows an example that our model benefits from selective multimedia encoder.

Instructions
(1) similarity to the next step measures the correctness of generated results with the next step; (2) similarity to future steps measures whether the generated results are relevant to the future steps; (3) diversity measures the diversity of generated results under the same subgoal (4) executability which checks the generated results repeat or conflict with step history/ Please rank these models' output from 1(best)-5(worst), ties are allowed if both outputs are the same. Similarity and executability annotation examples Title: protect garden berries Subgoal: setting up decoys Step History: use plastic snakes.

Goal: plant a plant
Subgoal: planting in outdoor soil Step 1: plant your plant in the spring or fall.

Next
Step?
Step History: Ground Truth: deepen the hole so the plant's root crown is at the soil line.
BART: place the plant in the hole.
Our Model: place the plant in the hole and cover it with soil.

Historical Relevant
Step: 1: widen the hole so it's twice the size of the root ball. 2: pull up any grass and weeds in and around the hole.
Caption 1: a plant that is growing out of the ground Step 2: remove the plant from its pot or netting.

Caption 2: a potted plant with a cross on it
Step 3: inspect and prune damaged roots.

Caption 3: how to cut a plant with pictures wikihow
Step 4: make a garden bed for flowers and bushes. Caption 4: a cartoon of a man digging a plant in the ground 4 3 Step 5: dig a hole 2 to 3 times wider than the plant's root ball.
Caption 5: a group of trees with roots in the ground  Step Prediction Results. It shows an example that our model prediction results matches future steps instead of immediate next step.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section 7
A2. Did you discuss any potential risks of your work?

Section 8
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract, Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Section 2, 3, 4, Appendix B,C B1. Did you cite the creators of artifacts you used?
Section 4, Appendix B,C B2. Did you discuss the license or terms for use and / or distribution of any artifacts? No response.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Section 2, 3, 4, Appendix B,C B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Not applicable. Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Appendix A,B The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.