Visual Goal-Step Inference using wikiHow

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.


Introduction
Recently, there has been growing attention on the representation of complex events, with renewed interest in script learning and commonsense reasoning (Park and Motahari Nezhad, 2018;Mujtaba and Mahapatra, 2019;Li et al., 2020). One aspect of event representation is the relationship between high-level goals and the steps involved (Zhang et al., 2020b,a). For example, given a goal (e.g. "change a tire"), an intelligent system should be able to infer what steps need to be taken to accomplish the goal (e.g. "place the jack under the car", "raise the jack"). In most work, events are represented as text (Zellers et al., 2018;Coucke et al., 2018;Zhang et al., 2019), while they could have different modalities in the real world.
Learning goal-step relations in a multimodal fashion is an interesting challenge since it requires reasoning beyond image captioning. We contend that multimodal event representation learning will have interesting implications for tasks such as that represents a step towards that goal.
schema learning (Li et al., 2020(Li et al., , 2021 to mitigate reporting bias (Gordon and Van Durme, 2013) since steps are often not explicitly mentioned in a body of text. For instance, if a robot is asked to "get a slice of cake," it has to know that it should "take the cake out of the box", "cut a slice", "put it on a plate", and then "take the plate to the user". Such steps are commonsense to people and thus rarely specified explicitly, making them hard to infer from textual data. However, with multimodal learning, we could obtain such details from visual signals. This multimodal goal-step relation could also be used for vision-enabled dialog systems 1 to recognize what task a user is completing and provide helpful recommendations. We propose a new task called Visual Goal-Step Inference (VGSI): given a textual goal and multiple images representing candidate events, a model must choose one image which constitutes a reasonable step towards the given goal. This means that a model should correctly recognize not only the specific action illustrated in an image (e.g., "turning on the oven", in Figure 1), but also the intent of the action ("baking fish").
We collect data from wikiHow articles, where most steps are illustrated with images. Our VGSI training set is constructed using three sampling strategies to select negative image candidates as distractors. In the format of multiple-choice and image retrieval, we evaluate four state-of-the-art multimodal models: DeViSE (Frome et al., 2013), Similarity Networks (Wang et al., 2018), Triplet Networks (Hoffer andAilon, 2015), and LXMERT (Tan and Bansal, 2019) to human performance. It is observed that SOTA models designed for captionbased multimodal tasks (Karpathy et al., 2014;Johnson et al., 2016) struggle on VGSI, exhibiting a 40% gap in accuracy from human performance when using a challenging sampling strategy.
One limitation of wikiHow is that most images are drawings rather than photos (which are more typically used in computer vision research). The knowledge learned from wikiHow is nevertheless useful when applied to real photos. We demonstrate this by pre-training a triplet-network on our wikiHow VGSI task and then conducting transfer learning on out-of-domain datasets. Our experiments show that pre-trained models can effectively transfer the goal-step knowledge to task-oriented video datasets, such as COIN (Tang et al., 2019) and Howto100m (Miech et al., 2019). In addition, we design an aggregation model on top of SOTA models which treats wikiHow as a knowledge base that further increases the transfer learning performance (see Appendix C).
We make three key contributions: (1) We pro- pose the VGSI task, which is more challenging than traditional caption-based image-text matching tasks and requires the model to have an intermediate reasoning process about goal-step relations.
(2) To study the VGSI task, we collect a multimodal dataset from wikiHow which contains over 770k images.
(3) Through transfer learning, we show that the knowledge learned from our dataset can be readily applied to out-of-domain datasets, with an accuracy improvement of 15-20% on VGSI.

wikiHow as Multimodal Resource
We use wikiHow as the data source for VGSI because it has been successfully adopted in prior work for procedural learning (Zhou et al., 2019) and intent detection (Zhang et al., 2020a) in the language domain. As shown in Figure 2, each wikiHow article contains a high-level goal and one or more different methods 2 to achieve it. Each method then includes a series of specific steps, typically accompanied with corresponding images. The format of wikiHow articles provides a hierarchical multimodal relationship between images and sentences. We can obtain three types of textimage pairs from wikiHow, in increasing specificity: goal-image, method-image, and step-image. However, these text-image pairs are not enough information for a system to succeed on VGSI; it also needs the appropriate background knowledge. For the example in Figure 2, the system needs to know that "Trick-or-Treating" and "candies" are Halloween traditions and that a "mask" is required during "COVID-19".
In total, as shown in Table 1, the corpus consists of 53,189 wikiHow articles across various categories of everyday tasks, 155,265 methods, and 772,294 steps with corresponding images 3 3 Methods

Problem Formulation
Given a high-level goal G-defined as a sequence of words-and an image I ∈ R 3×h×w -with the dimension of 3 color channels, the width, and the height-the model outputs the matching score: in which, X G ∈ R d G and X I ∈ R d I are the feature representations of the goal and the image, respectively. F is the scoring function that models the interactions between the two representations. At inference time, the model will choose the candidate with the highest matching score as the prediction.

Models
DeViSE takes in the pre-trained embedding vectors from the two modalities and maps the source vector onto the span of the target vector. DeViSE is trained only on the positive pairs (G, I) and projects an image embedding onto the same dimension as the goal with L2 normalization. Then it computes the cosine similarity of the two normalized vectors as the matching score. Similarity Network Each branch of the similarity network maps one modality to the joint span and executes point-wise multiplication to construct a joint vector. The last layer is fully-connected with softmax activation and outputs an array of size two to denote the weights of each class for binary classification. We compute the matching score by taking the dot product [1, −1] with the output. Triplet Network requires the input to be the format of a triplet (G, I pos , I neg ). Three branches in the network map the three embeddings to the same joint span, such that the branches of positive and negative images share the same weight. The network learns the cross-modality by minimizing the positive pair distance and maximizing the negative pair distance. We choose cosine distance as the metric which is also used as the matching score.
LXMERT is a multimodal encoder that aims to ground text to images through attention layers. The image input is represented as a sequence of objects and the sentence input is a sequence of words. LXMERT utilizes two single-modality transformer encoders (language and object encoders) and a cross-modality transformer encoder to achieve the attention mechanism and capture the relationship between the two modalities. Same as the similarity network, LXMERT is trained as a binary classifier.  Table 2: Accuracy of SOTA models on the wikiHow VGSI test set with different sampling strategies (sample size is shown in parentheses).
Figure 3: Accuracy of human (circles) and model (triangles) on the modified wikiHow VGSI test set with different textual input (e.g., in Fig 1, the goal prompt will be replaced by method -"Baking the Fish." or step -"Preheat the oven.").

Multiple-Choice Sampling
We formulate the task as a 4-way multiple choice question, which is easy for evaluating the imagetext matching performance and is feasible for human annotation. Specifically, a model is given a textual goal & four images to predict the most reasonable step towards the goal. We utilize three sampling strategies to obtain negative candidates: Random Strategy We randomly pick three different articles and select one image by chance from each article as the negative sample. Similarity Strategy We greedily select the most similar images based on the feature vectors and use FAISS (Johnson et al., 2019) to retrieve the top-3 most similar images from three different articles. Category Strategy The three negative samples are randomly selected from articles within the same wikiHow category as the prompt goal.
In addition to the multiple-choice format, we also evaluate VGSI in a more realistic goal-image retrieval format (see Appendix B).

Human Annotation
Considering that VGSI is a novel task, we also evaluate how difficult it is for humans. All of our six human annotators are graduate students with good English proficiency. For each annotation test, we selected 100 samples from the testing set. A pair of annotators completed each test and their scores were averaged.

Evaluation Metrics
We report both model and human accuracy for the multiple-choice task. For the retrieval task, we adopt recall at k (recall@k) and median rank (Med r) to measure the performance (see Appendix B). Table 2 shows the performance of the models and humans on the wikiHow dataset. The Triplet Network with BERT embeddings has the best performance. However, there is still a big gap between human and model performance, indicating that VGSI is challenging for even SOTA models. LXMERT performs badly using similarity and category strategies presumably because it heavily depends on grounding objects, and negative samples generated by these two strategies could share similar objects but refer to different goals. Figure 3 demonstrates that both humans and models perform better with lower-level texts as prompt, which reflects that our VGSI task is more challenging.

Transfer Learning
To robustly show the potential of wikiHow as a multimodal transfer learning resource, we compare it with two existing caption-based datasets, Flickr30K (Plummer et al., 2015) and MSCOCO (Vinyals et al., 2016), which are used as pre-training alternatives. We use the official train/val split for each dataset and pre-train two models separately on Flickr and MSCOCO using the same multiplechoice sampling strategies as VGSI.

Target Datasets & Keyframe Extraction
Our transfer targets include COIN and Howto100m, both large-scale datasets of instructional videos. Each video depicts the process of accomplishing a high-level goal, mostly everyday tasks. Since these two datasets are video-based while our task is image-based, we apply a key frame extraction heuristic to get critical frames from videos. We   then consider the key frames as steps, thus converting the datasets into the VGSI format. Howto100m: We randomly select 1,000 goals and one video for each goal. To extract key frames, we apply k-means clustering in the feature space of the frames of each video and select the closest frame to each cluster center. We further filter these frames by manually removing unrelated frames such as the introduction, transition animations, repetitive frames, etc. We finally obtain 869 goals 4 with 24.7 frames for each goal. COIN: We randomly select 900 videos (5 videos per goal) to construct the test set, and use the remaining 9,709 videos for training. Since COIN has annotations of textual steps and their corresponding video segment, we randomly select one frame within each video segment as a VGSI candidate, resulting in 230.1 frames per goal. Then we use these frames to construct the multiple-choice examples with the same three sampling strategies. We also compare using wiki-How against using COIN and Howto100m as pretraining data to perform transfer learning to each other since both are instructional video datasets.

Transfer Learning Performance
We use two different transfer learning setups for COIN 5 and Howto100m. For COIN, we formulate the test as a K-shot learning task where K is the number of VGSI training examples for each goal. The 180 goals for testing are seen during training to simulate the scenario where we have some instances of a task. For Howto100m, we split the 869 goals into 8:2 for training and testing, where the test goals are unseen during training.
Tables 3 and 4 both indicate that pre-training on wikiHow can improve VGSI performance on outof-domain datasets. Especially for the Howto100m results, the model pre-trained on wikiHow without fine-tuning outperforms even those pre-trained on other caption-based datasets that were fine-tuned on wikiHow. This is strong evidence that wikiHow can serve as a useful knowledge resource since the learned multimodal representation can be directly applied to other datasets.
To further validate whether the advantages of pre-training on wikiHow persist with the increasing number of fine-tuning examples, we report the performance with K ∈ {0, 5, 10, 15, 20, 25} for COIN and training examples ranging from 50 to 9,249 (full) for Howto100m. Shown in Figure 4 & 5, the model pre-trained on wikiHow consistently 5 The small number of goals in COIN leads to an extreme imbalance between video frames and texts, which makes it hard for training. Thus there is no train/test split on goals.

Conclusion
In this paper, we propose the novel Visual Goal- Step Inference task (VGSI), a multimodal challenge for reasoning over procedural events. We construct a dataset from wikiHow and show that SOTA multimodal models struggle on it. Based on the transfer learning results on Howto100m and COIN, we validate that the knowledge harvested from our dataset could transfer to other domains. The multimodal representation learned from VGSI has strong potential to be useful for NLP applications such as multimodal dialog systems.  G). Then, the model projects the image embedding onto the same dimension as the goal and we apply L2 normalization to obtain the unit vectors:X

References
where, L 2 N stands for L2 normalization and W I→G ∈ R d I ×d G is the weight. Then the DeViSE model uses a similarity function (here we choose cosine distance) to compute the distance betweenX I andX G as the loss: In which cos means the cosine distance. For De-ViSE, the matching score is the cosine similarity between the two unit vectors.

A.1.2 Similarity Network
A Similarity Network is one type of two-branch networks for matching an image and text. It is a supervised model which takes in (G i , I i , y i ), and y i ∈ {0, 1} is the binary label that indicates whether G i and I i are related or not. Each branch of the network maps one modality to the cross-modality span and executes pointwise multiplication to construct a joint vector: in which, W I→J ∈ R d I ×d J and W G→J ∈ R d G ×d J are the weights and represents an element-wise product.
The similarity network can be viewed as a binary classifier, and therefore we could use binary crossentropy (BCE) as the loss function: The last layer of the similarity network is a fullyconnected layer with a softmax activation function, and the output is an array of size two, in which the elements denote the weight for each class. We compute the matching score by multiplying +1 (matched) and −1 (unmatched) on these two elements: where fc stands for fully-connected layer.

A.1.3 Triplet Network
A Triplet Network requires the input to be in the format of a triplet (G, I pos , I neg ). There will be three branches in the network which map the three embeddings to the same joint span: in which, W G→J ∈ R d G ×d J and W I→J ∈ R d I ×d J are weights, and the branches of positive and negative images share the same weight. The network learns the cross-modality by minimizing the distance between positive pairs and maximizing the distance between negative pairs. We choose cosine distance as the distance function which will also be used to compute the matching score: Where m is the margin, which is set to 0.2 in the experiment. LXMERT learns a position-aware embedding as follows: The text tokens are extracted using a tokenizer (Wu et al., 2016) and converted to index-aware embeddings s.t. w i and i are projected onto embedding spaces w i , u i , to get a common embedding.
Those inputs are then passed through a language encoder E G , an object relationship encoder E I , and a cross-modality transformer encoder E J . Let X I = {v 1 , v 2 , . . . , v n } and X G = {h 1 , h 2 , . . . , h n }.X Then the cross-modality output X J is extracted from the output embedding X G J that corresponds to the special token [CLS] appended to each input text.
Similarly to A.1.2, we use BCE loss.
and compute the matching score:

A.2.1 Vision
We select InceptionV3 (Szegedy et al., 2015) as the feature extractor for the image. We have tried VGG19 and Resnet50, but InceptionV3 turns out to have the best performance. We use the second last hidden layer of InceptionV3 to obtain a vector of (2048, ).

A.2.2 Language
We use a pre-trained BERT sentence transformer (Reimers and Gurevych, 2019) with bert-base-uncased as our base model. Then, we use max-pooling to get the feature vector with a dimension of (768, ).

A.3 Hyper Parameters
See Table 5.

A.4 Training Details
The training of DeViSE, Similarity Network and Triplet Network were on a single NVIDIA RTX 2080 for 200 epochs with early stopping. The training took less than 10 hours. We used a pre-trained LXMERT model with 9 language layers, 5 cross-encoder layers, 5 vision encoder layers, and a 2 layer linear classification head, with GELU() (Hendrycks and Gimpel, 2016) and ReLU() activation, with a Sigmoid final layer and with normalization in the first layer.
We fine-tune the model for 10 epochs while allowing the gradient to flow through the LXMERT pre-trained layers. We use a binary cross-entropy loss from the PyTorch library and an Adam (Kingma and Ba, 2014) optimizer. Note that we deal with imbalanced datasets by repeating the positive samples and shuffling the data.

B Goal-Image Retrieval Task B.1 Sampling
Goal-Image Retrieval is a more practical format that gives a high-level goal and a pool of images and aims to rank these images based on their similarity with the goal query.
In this experiment, we randomly select 1,000 high-level goals from the testing set of multiplechoice tasks and choose 5 images for each goal, thus building a pool of 5,000 images.

B.3 In-Domain Performance
As shown in Table 6, the triplet network with BERT as the text embedding has the best performance.

B.4 Query on Different Prompts
As can be seen from Table 7, the model has higher performance when using the detailed step description as a prompt. Through qualitative analysis (see Figure 8) on some samples, we discovered that some method descriptions are very general, and short abstract keywords are even more refined than the goal description. To quantify this finding, we calculate the average length of tokens (remove stop words) and the vocabulary size of the three types of prompts. Apparently, the step description is more fruitful than the method and goal with higher token length and vocab size. The method described has a lower average length of tokens, which is in line with our observation.

B.5 Transfer Performance on Retrieval
We also evaluate the transfer performance on a retrieval task. For COIN, we choose 5-6 images for each video from the 180 goals and construct a pool of 1,000 images. For Howto100m, we randomly select 5-6 images of each of the videos in the testing set and also form a pool of 1K images. Table 8 and 9 indicates the model pre-trained on wikiHow outperforms the other dataset in the retrieval task and the aggregation model could further improve the performance.

C Step-Aggregation Model
We have seen that SOTA models do not perform well in VGSI because of the implicit visionlanguage relation. So we develop a step aggregation model that takes advantage of the existing goal-step knowledge from wikiHow. The main idea is as follows: given an unseen textual goal, we use k-nearest neighbors to find the most related   article title from wikiHow, then extract the n steps from this article as S = {s 1 , s 2 , ..., s n }. Instead of directly using the given goal to match the images (goal score -Score g ), we could use the sequence of steps to improve the matching (step score -Score s ). Then use linear interpolation to summarize these two scores as our final matching score.
Score g = match(G, I) Score s = max i=1:n (match(s i , I)) Score f inal = λ · Score g + (1 − λ) · Score s where, λ adjusts the step and goal scores weights, we choose λ = 0.5. The main idea of the model is to break down the high-level goal into intermediate steps via schema. Then we use the induced sequence of steps as the new query to improve the matching performance. For example in Figure 6, when we want to match the goal "Install License Plate" with two images, the model makes a wrong choice because the negative sample (the right one) also involves the "install" action. However, we could fetch the intermediate steps from wikiHow and use these steps to match the images. The left image (the correct choice) has a higher Step-Image similarity score than the right one. Therefore, the model could improve its performance with the help of this step information. As we can see from the example steps, they contain some useful entities such as "screw", "bracket", "bumper", etc., which are closely related    to the visual information in the image but do not show up in the goal sentence.
We apply the aggregation model on both multiple-choice and retrieval VGSI tasks. As shown in Table 10 and 11, with the assistance of the aggregation model, the accuracy of multiplechoice increased by 0.2% -2%, and the median rank of retrieval decreased by 5%. Since our approach to utilize these steps is very simple, but still achieve a marginal improvement. We hope to see more advanced models to realize the full potential of wikiHow steps.