Learning Action Conditions from Instructional Manuals for Instruction Understanding

The ability to infer pre- and postconditions of an action is vital for comprehending complex instructions, and is essential for applications such as autonomous instruction-guided agents and assistive AI that supports humans to perform physical tasks.In this work, we propose a task dubbed action condition inference, which extracts mentions of preconditions and postconditions of actions in instructional manuals.We propose a weakly supervised approach utilizing automatically constructed large-scale training instances from online instructions, and curate a densely human-annotated and validated dataset to study how well the current NLP models do on the proposed task.We design two types of models differ by whether contextualized and global information is leveraged, as well as various combinations of heuristics to construct the weak supervisions.Our experiments show a > 20% F1-score improvement with considering the entire instruction contexts and a > 6% F1-score benefit with the proposed heuristics. However, the best performing model is still well-behind human performance.


Introduction
When accomplishing complex tasks (e.g.making a gourmet dish) composed of multiple action steps, instructional manuals are often referred to as the important and useful guidelines.To follow the instructed actions, it is crucial to ensure the current situation fulfills all the necessary preconditions, i.e. prerequisites to be met, before taking a particular action.Similarly, it is essential to infer the postconditions, the effect supposed to be caused after performing such an action, to make sure the execution of the action is successful and as expected.
For autonomous agents or assistant AI that aids humans to accomplish certain tasks, understanding these conditions enables the agent to make correct judgements on whether to proceed to the next action, as well as evaluating the successfulness of a  1, before performing the action "place onions" in step 3, both "heat the pan" (step 2) and "slice onions" (step 1) have to be successfully accomplished, and hence should be regarded as preconditions of step 3. On the other hand, after executing "stir onions" (step 4), its desired outcome, "caramelized", should be recognized as the postcondition in order to assess the completion of the execution.These action and its pre/postcondition dependencies are prevalent in instructional texts and can be inferred by comprehending the instruction texts.To this end, we propose the action condition inference task on instructional manuals, where a dependency graph is induced as in Figure 1 to denote the pre-and postconditions of actions.
We consider two popular online instructional manuals, WikiHow (Hadley et al.) and Instructables.com (ins), to study the current NLP models' capabilities of performing the proposed action condition inference task.As there is no densely annotated dataset for the complex pre-and postcondition dependency structures of actions, we collect Step 1: Prepare the line.The length and width of line you need varies based on your trimmer.If you purchase the wrong width of line, the trimmer will not work correctly, so don't waste your money by simply guessing at the hardware store.If you are not sure what size line your trimmer uses, check online-the manufacturer's website often has instructions, and if not the customer service department should be able to help you.… Step 2: Make sure your trimmer's engine is turned off.If it has a gearbox, make sure it is cooled down.This will help prevent accidents.
Step 3: Remove the retaining cap from the trimmer head.This will probably involve either unscrewing it, pressing one or multiple tabs, or a combination of the two.Some models use different mechanisms for removing the spool.Notice that an actionable can have multiple pre-or postconditions and they can span across different instruction steps.For simplicity we do not show an exhausted set of text segments of interests, i.e. in the actual dataset there might be more.(Right) we show one sample SRL extractions which correspond to one of the action-condition dependency linkages on the left.
comprehensive human annotations on a subset of 650 samples.This allows us to benchmark models in either a zero-shot setting where no annotated data is used for training, or a low-resource setting with a limited amount of annotated training data.
We also design heuristics to automatically construct weakly supervised training data.Specifically, we consider the following heuristics: (1) Key entity tracing: We hypothesize that if the same entity (including resolved co-references) is mentioned in two instruction descriptions, there is likely a dependency between them.(2) Keywords: Certain keywords, such as the word before in the description "do X before doing Y", can often imply the condition dependencies.(3) Temporal reasoning: While conditional events are naturally temporally grounded (e.g.preconditions should occur prior to an action), the narrated order of events may not be consistent with their actual temporal order.We thus adopt a temporal relation resolution module (Han et al., 2021b) to alleviate such an issue.
To benchmark the proposed task, we consider two types of models, one only takes a pair of input descriptions and predicts their relation without other contexts, and the other takes the entire instruction paragraphs into account to leverage contextualized global information.It is shown that weak supervisions can benefit learning with limited labeled data in many NLP tasks (Plank and Agić, 2018;Hedderich et al., 2020), thus we also propose different ways to combine annotated and unlabelled data to further improve the model performance.
We evaluate the models on a held-out test-set of the annotated data, where we observe the contextualized models outperform the non-contextualized counterparts by a large margin (> 20% F1-score), and that our proposed heuristics further improve the contextualized models significantly (> 6% F1-score) on the low-resource setting.In addition, we conduct ablation studies on the designed heuristics to assess their respective contributions to provide more in-depth analysis of the nature of both our task and the utilized instructions.
Our key contributions are three-fold: (1) We propose the action-condition inference task and create a densely human-annotated dataset to spur research on structural instruction comprehensions.(2) We design heuristics utilizing entity tracing, keywords, and temporal common sense to construct effective large-scale weak supervisions.(3) We benchmark model performance on the proposed task to shed lights on future research in this direction.

Terminologies and Problem Definition
Our goal is to learn to infer the knowledge of actioncondition dependencies in real-world task-oriented instructional manuals.We first describe the terminologies used throughout the paper: Actionable refers to a phrase that a person can follow and execute in the real world (yellow colored phrases in Figure 2).We also consider negated actions (e.g.do not ...) or actions warned to avoid (e.g. if you purchase the wrong...) as they likely also carry useful knowledge regarding the tasks.1 Precondition concerns the prerequisites to be met for an actionable to be executable, which can be a status, a condition, and/or another prior actionable (blue colored phrases in Figure 2).It is worth noting that humans can omit explicitly writing out certain condition statements because of their triviality as long as the actions inducing them are mentioned (e.g.heat the pan → pan is heated, the latter can often be omitted).We thus generalize the con-ventional formulation of precondition used in planning languages such as STRIPS (Fikes and Nilsson, 1971), i.e. sets of statements evaluated to true/false, to a phrase that is either a passive condition statement or an actionable that induces the prerequisite conditions, as inspired by (Linden, 1994).
Postcondition is defined as the outcome caused by the execution of an actionable, which often involves status changes of certain objects (or the actor itself) or certain effects emerged to the surroundings or world state (green colored phrases in Figure 2).
Text segment is the term we will use to refer to a textual segment of interest, which can be one of the: {actionable, precondition, postcondition} statements, throughout the rest of the paper.
In reality, a valid actionable phrase should have both precondition and postcondition dependencies, as a real-world executable action will always have certain prerequisites to meet and outcomes caused.However, we do not enforce this in this work as conditions can occasionally be omitted by the authors of human written instructions.
Problem Formulation.Given an input instructional manual and some text segments of interest extracted from it, a model is asked to predict the directed relation between a pair of segments, where the relation should be one of the followings: NULL (no relation), precondition, or postcondition.

Datasets and Human Annotations
We are interested in understanding the current NLP models' capability on inferring the action-condition dependencies in instructional manuals.To this end, we consider two popular online instruction resources, WikiHow and Instructables.com, both consist of articles composed of multiple steps with their detailed step descriptions, to support our investigation.For WikiHow, we use the provided dataset from (Wu et al., 2022); for Instructables, we scrape the contents directly from their website.
As densely annotating large-scale instruction sources for the desired dependencies can be extremely expensive and laborious, we propose to train the models via a weakly supervised method utilizing a few designed heuristics to construct large-scale training data automatically, and then finetune the models with limited human annotated instructions to further improve the performance.For this purpose as well as performing a more grounded evaluation, we collect comprehensive hu-man annotations primarily on a selected subset in each dataset to serve as our annotated-set, and particularly the subsets used to evaluate the models as the annotated-test-set.2In total, our densely annotated-set has 500 samples in WikiHow and 1503 samples in Instructables.In Section 6.2, we will describe how the annotated-set is split to facilitate the low-resource training.We also collect the human performance on the annotated-test-set to gauge the human upper bound of our proposed task.More dataset details are in Append.Sec. A.

Annotations and Task Specifications
Dataset Structure.The basic structure of the data we desire to construct features two main components: (1) text segments, which encompass the main action/condition descriptions as indicated in Section 2, and (2) linkage, a directed relational link connecting a pair of text segments.
Annotation Process.We conduct the annotatedset construction via Amazon Mechanical Turk (MTurk).Each MTurk worker is prompted with a multi-step instructional manual with its intended goal, where the annotation process consists of three main steps: (1) Text segments highlighting: To facilitate this step (as well as postulating the text segments of interest for automatically constructing weak-supervision data in Section 4), we prehighlight several text segments extracted by semantic role labelling (SRL) for workers to choose from4 , however, they can also freely annotate (highlight by cursor) their more desirable segments.(2) Linking: We encourage the workers to annotate all the possible segments of interest, and then they are asked to connect certain pairs of segments that are likely to have dependencies with a directed edge.
(3) Labelling: Finally, each directed edge drawn will need to be labelled as either a preor postcondition (NULL relations do not need to be explicitly annotated).More details are in Append.Sec.B.
Since the agreements among workers on both text segments and condition linkages are sufficiently high5 , our final human annotated-set retain the majority voted segments and linkages.The action prying should occur prior to stepping, but these two segments are reversely narrated in the contexts.
Table 1: Sample Linking Heuristics: For each of the applied heuristics we show one or two exemplar use cases and their detailed descriptions.The color schemes are the same as Figure 2.
Variants of Tasks.Although proper machine extraction of the text segments of interest (especially for actionables) as a span-based prediction can be a valid and interesting task6 , in this paper, we mainly focus on the linkage prediction (including their labels) assuming that these text segments are given, and leave the overall system, i.e. end-to-end text segment extraction and linkage prediction, as the future work.Our proposed task and the associated annotated-set can be approached by a zero-shot or low-resource setting: the former involves no training on any of the annotated data and a heuristically constructed training set can be utilized (Section 4), while the latter allows models to be finetuned on a limited annotated-subset (Section 5.3).

Training With Weak Supervision
As mentioned in Section 3, our proposed task can be approached via a zero-shot setting, where the vast amount of un-annotated instruction data can be transformed into useful training resource (same dataset structure as described in Section 3.1).Moreover, it is proven that in many low-resource NLP tasks, constructing a much larger heuristicbased weakly supervised data can be rather beneficial (Plank and Agić, 2018;Nidhi et al., 2018).

Linking Heuristics
The goal of incorporating certain heuristics is to perform rule-based determination of the linkage (i.e. the action-condition dependency) between text segments within an article.We mainly consider heuristics that are widely applicable to all kinds of instructional data, as long as they share similar (step-by-step) written style.There are four types of heuristics incorporated: (1) Keywords: certain keywords are hypothesized to show strong implication of conditions such as if, before, after; (2) Key entity tracing: text segments that share the same key entities are likely indicating dependencies; (3) Co-reference resolution technique is adopted to supplement (2); (4) Event temporal relations: we incorporate temporal resolution technique to handle scenarios when narrative order does not align with the actual temporal order of the events.
Without access to human refinements (Section 3.1), we leverage SRL to postulate all the segments of interests to construct the weakly-supervised set.

Keywords
In Table 2 we list the major keywords that are considered in this work.As illustrated in the second row of Table 1, keywords are utilized so as the text segments separated with respect to them can be properly linked.Different keywords and their positions within sentences (or paragraphs) can lead to different directions of the linkages, e.g.before and after are two keywords that intuitively can lead to different directions if they are placed at nonbeginning positions.We follow the rules listed in Table 2 to decide the directions.

Key Entity Tracing
It is intuitive to assume that if the two text segments mention the same entity, a dependency between them likely exists, and hence a trace of the same mentioned entity can postulate potential link-ages.As exemplified in the first row of Table 1, that heating the pan being a necessary precondition to placing onions in the pan can be inferred by the shared mention "pan".We adopt two ways to propose the candidate entities: (1) We extract all the noun phrases within the SRL segments (mostly ARG-tags), (2) Inspired by (Bosselut et al., 2018), a model is learned to predict potential entities involved that are not explicitly mentioned (e.g.fry the chicken may imply a pan is involved) in the context (more details see Append.Sec.C.1.3).
Co-References.Humans often use pronouns to refer to the same entity to alternate the mentions in articles, as exemplified by the mentions onions and them, in the first row of Table 1.Therefore, a straightforward augmentation to the aforementioned entity tracing is incorporating co-references of certain entities.We utilize a co-reference resolution model (Lee et al., 2018) to propose possible co-referred terms of extracted entities of each segment within the same step description (we do not consider cross-step co-references for simplicity).

Linking Algorithm
After applying the aforementioned linking heuristics, each text segment, denoted as a i , can have M linked segments: {a l i 1 , ..., a l i M }.For linkages that are traced by entity mentions (and co-references), their directions always start from priorly narrated segments to the later ones, while linkages determined by the keywords follow Table 2 for deciding their directions.However, the text segments that are narrated too much distant away from a i are less likely to have direct dependencies.We therefore truncate the linked segments by ensuring any a l i j is narrated no more than "S step" ahead of a i , where S is empirically chosen to be 2 in this work.

Incorporating Temporal Relations
As hinted in Section 2, the conditions with respect to an actionable imply their temporal relations.As previously mentioned, the direction of an entitytrace-induced linkage is determined by the narrated order of text segments within contexts, however, in circumstances such as the fourth row in Table 1, the narrative order can be inconsistent with the actual temporal order of the associated events.To alleviate such inconsistency, we apply an event temporal relation prediction model (Han et al., 2021b) to fix the linkage directions. 7The utilized model predicts 7 These do not include linkages decided by the keywords.

Keywords
Begin.Within Sent.
before, until, in order to, so Table 2: Keywords used for deciding a linkage: If a keyword is at the beginning of a sentence, we use the (first) comma of that sentence to separate it to two segments and link them accordingly, while the keyword itself is used as the separator otherwise.The segments are then either refined with SRL or kept as they are if SRL does not detect a valid verb.
temporal relations8 of each pair of event triggers (extracted by SRL, i.e. verbs/predicates), and then we invert the direction of an entity-trace-induced linkage, a l i j → a i , if their predicted temporal relation is opposite to their narrated order.

Labelling The Linkages
It is rather straightforward to label precondition linkages as a simple heuristic can be used: for a given segment, any segments that linked to the current one that are either narrated or temporally prior to it are plausible candidates for being preconditions.For determining postconditions, where they are mostly descriptions of status (changes), we therefore make use of certain linguistic cues that likely indicate human written status, e.g. the water will be frozen and the oil is sizzling.Specifically, we consider: (1) be-verbs followed by present-progressive tenses if the subject is an entity, and (2) segments whose SRL tags start with ARGM as exemplified in Table 1.

Models
To benchmark the proposed task, we mainly consider two types of models: (1) Non-contextualized pairwise prediction model takes only the two text segments of interest at a time and make the trinary (directed) relation predictions, i.e.NULL, precondition, and postcondition; (2) Contextualized model also makes the relation prediction for every pair of input segments, but the model takes as inputs the whole instruction paragraphs so the contexts of the segments are preserved.The two models are both based off pretrained language models, and the relation prediction modules are multi-layer perceptrons (MLPs) added on top of the language models' outputs.Cross-entropy loss is used for training.

Non-Contextualized Pairwise Model
For the non-contextualized model, we feed the two text segments of interest, a i and a j , to the language model similar to the next sentence prediction objective in BERT (Devlin et al., 2019) (i.e. the order of the segments matters, which will be considered in determining their relations), as illustrated in Figure 3a.Similar to BERT, the [CLS] representation is fed to an MLP to predict the relation.

Contextualized Model
The architecture of the contextualized model is as depicted in Figure 3b.Denote the tokens of the instruction text as {t i } and the tokens of ith text segment of interest (either automatically extracted by SRL or annotated by humans) as {a ij }.A special start and end of segment token, <a> and </a>, is wrapped around each text segment and hence the input tokens become: "t 1 , ..., t k , <a> a i1 , a i2 , ..., a iK </a>, ...".The contextualized segment representation is then obtained by applying a mean pooling over the language model output representations of each of its tokens, i.e. denote the output representation of To determine the relation between segment i and j, we feed their ordered concatenated representation, concat(o(a i ), o(a j )), to an MLP for the relation prediction.

Learning
Multi-Staged Training.For different variants of our proposed task (Section 3.1), we can utilize different combinations of the heuristically constructed dataset and the annotated-train-set.For the lowresource setting, our models can thus undergo a multi-staged training where they are firstly trained on the constructed training set, and then finetuned on the annotated-set.Furthermore, following the self-training paradigm (Xie et al., 2020;Du et al., 2021), the previously obtained models can be uti-lized to construct pseudo supervisions by augmenting their predictions to (and sometimes correcting) the heuristically constructed data to learn a more robust prior to be finetuned on the annotated-set.
Label Balancing.It is obvious that most of the relations between randomly sampled pairs of text segments will be NULL, and hence the training labels are therefore imbalanced.To overcome such an issue, we downsample the negative samples when training the models.Specifically, we fill each training mini-batch with equal amount of positive (relations are not NULL) and negative pairs, where the negatives are constructed by either inverting the positive pairs or replacing one of the segment with another randomly sampled one within the same article that has no relation to the remaining segment.

Experiments and Analysis
Our experiments seek to answer these questions: (1) How well can the models and humans perform on the proposed task?(2) Is instructional context information important for action condition inference?(3) Are the proposed heuristics and the secondstage self-training effective?

Training and Implementation Details
For both non-contextualized and contextualized models, we adopt the pretrained RoBERTa (-large) language model (Liu et al., 2019) as the base model.All the linguistic features, i.e.SRL (Shi and Lin, 2019), co-references, POS-tags, are extracted using models implemented by AllenNLP (Gardner et al., 2017).We truncate the input texts at maximum length of 500 while ensuring all the text segments within this length is preserved completely.

Experimental Setups
Data Splits.The primary benchmark of Wiki-How annotated-set is partitioned into train (30%), Evaluation Metrics.We ask the models to predict the relations on every pair of text segments in a given instruction, and compute the average precision (Prec.),recall, and F-1 scores with respect to the precondition and postcondition labels respectively, across the entire test-set.

Main Results
Table 3 upper half summarizes both the human and model performance on our standard split (30% train, 60% test) of WikiHow annotated-set.Contextualized model obviously outperforms the noncontextualized counterpart by a large margin.Significant improvements on both pre-and postcondition inferences can be noticed when heuristically constructed data is utilized, especially when no second-stage self-training is involved.The best performance is achieved by applying all the heuristics we design, and can be further improved by augmenting the constructed weakly-supervised dataset with pseudo supervisions.Similar performance trends can be observed in Table 3 lower half where a zero-shot transfer from models trained on Wiki-How data to Instructables is conducted.In either datasets, there are still large gaps between the best model and human performance (>20% F1-score).
Heuristics Ablations.Table 3 also features ablation studies on the designed heuristics.One can observe that keyword are mostly effective on inferring the postconditions, and co-references are significantly beneficial in the Instructables data, which can hypothetically be attributed to the writing style of these two datasets (i.e.authors of Instructables could be using co-referred terms much more).Temporal relation resolution is consistently helpful across pre-and postconditions as well as datasets, suggesting only relying on narrated orders could degenerate the performance.
Error Analysis.Our (best) model performs well on linkages that share some similarities to the designed heuristics, which is expected, but can sometimes overfit to certain heuristic concepts, e.g.erroneously predicting "use a sharp blade to cut ..." as a precondition to "look for a blade" (entity tracing) in a food preparation context.Another representative error can be attributed to causal understanding, which is currently not handled by our heuristics and can be an interesting future work, e.g.not able to predict "decrease the pedal resistance" having a precondition "body start leaning to the sides" (this example is extracted from segments not link-able even via the keyword heuristic) in a biking context.

The Effect of Training Set Size
Table 3 shows that with a little amount of data for training, our models can perform significantly better than the zero-shot setting.This arouses a question -how would the performance change with respect to the training set size, i.e. do we just need more data?To quantify the effect of training size on model performance, we conduct an experiment where we vary the sample size in the training set while fixing the development (10%) and test (30%) set for consistency consideration.We use the best settings in Table 3, i.e. with all the heuristics applied and the two-staged self-training adopted, for this study.The results are reported in Table 4.We can observe a plateau in performance when the training set size is approaching 60%, implying that simply keep adding more training samples does not necessarily yield significant improvements.

Related Works
Procedural Text Understanding.Uncovering knowledge in texts that specifically features procedural structure has drawn many attentions, including aspects of tracking entity state changes (Branavan et al., 2012;Bosselut et al., 2018;Mishra et al., 2018;Tandon et al., 2020), incorporating common sense or constraints (Tandon et al., 2018;Du et al., 2019), procedure-centric question answering (QA) (Tandon et al., 2019), and struc-tural parsing or generations (Malmaud et al., 2014;Zellers et al., 2021).(Clark et al., 2018) leverages VerbNet (Schuler, 2005) with if-then constructed rules, one of the keywords we also utilize, to determine object-state postconditions for answering state-related reading comprehension questions.In addition, some prior works also specifically formulate precondition understanding as multiple choice QA for event triggers (verbs) (Kwon et al., 2020) and common sense phrases (Qasemi et al., 2021).We hope our work on inferring action-condition dependencies, an essential knowledge especially for understanding task-procedures, from long instruction texts, can help advancing the goal of more comprehensive procedural text understanding.
Drawing dependencies among procedure steps has been explored in (Dalvi et al., 2019;Sakaguchi et al., 2021), however, their procedures come from manually synthesized short paragraphs.Our work, on the other hand, aims at inferring diverse dependency knowledge directly from more complex real-world and task-solving-oriented instructional manuals, enabling the condition dependencies to go beyond inter-step and narrative order boundaries.

Conclusions
In this work we propose a task on inferring action and (pre/post)condition dependencies on realworld online instructional manuals.We formulate the problem in both zero-shot and low-resource settings, where several heuristics are designed to construct an effective large-scale weakly supervised data.While the proposed heuristics and the twostaged training leads to significant performance improvements, the results still highlight significant gaps below human performance (> 20% F1-score).
We provide insights and the collected resources to spur relevant research, and suggest the following future works: (1) As our data also features the span-annotations of the text segments, end-to-end proposing actionables, conditions, and their relations can be a next-step.(2) The knowledge of the world states implied by the text descriptions as well as external knowledge of the entities can be augmented into our heuristics.(3) Equipping models with causal common sense could be beneficial.

Our work (currently) has the following limitations:
(1) We currently do not deal with end-to-end actionable and condition-dependency inferring.While this work focuses on predicting the relation linkages, we look forward to actualizing a more comprehensive system in the future that can also predict proper actionable (and condition) text segments that can be evaluated against with our human annotations as well.
(2) The current system is only trained on unimodal (text-only) and English instruction resources.Multilingual and multimodal versions of our work could be as well an interesting future endeavors to make.(3) In this work, we mostly consider instructions from physical works.While certain conditions and actions can still be defined within more social domain of data (e.g. a precondition to being a good person might be cultivating good habits).As a result, we can not really guarantee the performance of our models when applied to data from these less physical-oriented domains.

Ethics and Broader Impacts
We hereby acknowledge that all of the co-authors of this work are aware of the provided ACM Code of Ethics and honor the code of conduct.This work is mainly about inferring pre-and postconditions of a given action item in an instructional manual.The followings give the aspects of both our ethical considerations and our potential impacts to the community.
Dataset.We collect the human annotation of the ground truth condition-action dependencies via Amazon Mechanical Turk (MTurk) and ensure that all the personal information of the workers involved (e.g., usernames, emails, urls, demographic information, etc.) is discarded in our dataset.Although we aim at providing a test set that is agreed upon from various people examining the instructions, there might still be unintended biases within the judgements, we make efforts on reducing these biases by collecting diverse set of instructions in order to arrive at a better general consensus on our task.
This research has been reviewed by the IRB board and granted the status of an IRB exempt.The detailed annotation process (pay per amount of work, guidelines) is included in the appendix; and overall, we ensure our pay per task is above the the annotator's local minimum wage (approximately $15 USD / Hour).We primarily consider English speaking regions for our annotations as the task requires certain level of English proficiency.

Techniques.
We benchmark the proposed condition-inferring task with the state-of-the-art large-scale pretrained language models and our proposed training paradigms.As commonsense and task procedure understanding are of our main focus, we do not anticipate production of harmful outputs, especially towards vulnerable populations, after training (and evaluating) models on our proposed task.

A Details of The Datasets
Resource-wise our work utilizes online instructional manuals (e.g.WikiHow) following many existing works (Zhou et al., 2019;Zhang et al., 2020;Wu et al., 2022), specifically, the large-scale WikiHow training data is provided by (Wu et al., 2022), while we scrape the Instructables.comdata on our own.
We report the essential statistics of the annotatedsets in Table 5.Each unique URL of WikiHow can have different multi-step sections, and we denote each unique section as a unique article in our dataset; while for Instructables.com, each URL only maps to a single section.As a result, for Wiki-How we firstly manually select a set of URLs that are judged featuring high quality (i.e.articles consisting clear instructed actions, and contain not so much non-meaningful or unhelpful monologues from the writer) instructions and then sample one or two sections from each of the URLs to construct our annotated-set.The statistics of the datasets used to construct the large-scale weakly supervised WikiHow training set can be found in Section 3 of (Wu et al., 2022), where we use their provided WikiHow training samples that are mostly from physical categories.
* Our densely annotated datasets and relevant tools will be made public upon paper acceptance.

A.1 Dataset Splits
The whole annotated Instructables.com data samples are used as an evaluating set so we do not need to explicitly split them.For WikiHow, we split mainly with respect to the URLs to ensure that no articles (i.e.sections) from the same URL are put into different data splits, so as to prevent model exploiting the writing style and knowledge from the same URL of articles on WikiHow.The splitting on the URL-level is as well a random split.

B Details of Human Annotations B.1 Inter-Annotator Agreements (IAAs)
There are two types of inter-annotator agreements (IAAs) we compute: (1) IAA on text segments and (2) IAA on linkages, and we describe the details of their computations in this section.
IAA on Text Segments.For each workerhighlighted text segment, either coming from directly clicking the pre-highlighted segments or their own creations, we compute the percentage of the overlapping of the tokens between segments annotated by different workers.If this percentage is > 60% of each segment in comparison, we denote these two segments are aligned.Concretely, for all the unique segments of the same article, annotated by different workers, we can postulate a segment dictionary where the aligned segments from different worker annotations are combined into the same ones.And hence each worker's annotation can be viewed as a binary existence of each of the items in such a segment dictionary, where we can compute the Cohen's Kappa inter-annotator agreement scores on every pair of annotators to derive the averaged IAA scores.
IAA on Linkages.Similar to the construction of a segment dictionary, we also construct a linkage dictionary where every link has a head segment pointing to the tail segment, with both of the segments coming from an item in the segment dictionary.We thus can also treat the annotation of the linkages across different worker annotations as a binary existence and perform similar inter-annotator agreement computations.The resulting IAAs for each dataset and annotation types are reported in Section 3.1.

B.2 Annotation Process
We adopt Amazon Mechanical Turk (MTurk) to publish and collect our annotations, whwere each of the annotation in the MTurk is called a Human Intelligence Task (HIT).As shown in Figure 4a, on the top of each HIT we have a detailed description of the task's introduction, terminologies, and instructions.For the terms we define, such as actionables and pre-/postconditions, we also illustrate them with detailed examples.To make it easier for workers to quickly understand our tasks, we provide a video version explaining important concepts and the basic operations.We also set up a Frequently Asked Question (FAQ) section and constantly update such section with some questions gathered from the workers.
Figure 4b shows the layout of the annotation panel.A few statements are pre-highlighted in grey and each of them is clickable.These statements are automatically pre-selected using the SRL heuristics described in Section 3.1, which are supposed to cover as much potential actionables and pre-/postconditions as possible.Workers can either simply click the pre-highlighted statements or redo the selection to get their more desired segments.The clicked or selected statements will pop up to the right panel as the text-blocks.For the convenience to manage the page layout, each text-block is dragable and can be moved anywhere within the panel.The workers then should examine with their intelligence and common sense to connect text-blocks (two at a time) by right clicking one of them to start a directed linkage (which ends at another text-block) and choose a proper dependency label for that particular drawn linkage.
Since our annotation task can be rather complicated, we would like our workers to fully understand the requirements before proceeding to the actual annotation.All annotators are expected to pass three qualification rounds, each consisting of 5 HITs, before being selected as an official annotator.15 HITs are annotated internally in advance as the standard answers to be used to judge the qualification round qualities.We calculate the IAAs of each annotator against our standard answers to measure their performance in our task.In each round, only the best performers move on to the next.At the end of each round, we email annotators to explain the questions they asked or some of the more commonly made mistakes shared across multiple workers.In total, over 60 workers participated in our task, and 10 of them passed the qualification rounds.We estimate the time required to complete each of our HITs to be 10-15 minutes, and adjust our pay rate to $2.5 and $3 USD for the qualification and the actual production rounds, respectively.This roughly equates to a $15 to $18 USD per hour wage, which is above the local minimum wage for the workers.We also ensure that each of our data samples in the official rounds is annotated by at least two different good workers.
Confidence Levels.We compute the averaged percentage of confidence levels reported by the workers in Table 6.Note that majority of the workers indicate a moderately or fairly confidence levels, implying they are sufficiently confident about their annotations.We also see feedback from workers that some of them rarely use strong words such as very to indicate their confidence levels, and hence the resulted statistics of their confidences could be a bit biased towards the medium.
Human Performance.We randomly select 100 samples from the WikiHow annotated-test-set and 50 samples from the Instructables.comannotatedtest-set for computing the human performance.The allowed inputs are exactly the same as what models take, i.e. given all the instruction paragraph as context and highlighted (postulated text segment boxes) text segments of interests, workers are asked to predict the relations among such segments so as to induce a complete dependency graph.For each sample, we collect inputs from two different workers, and ensure that the workers are not the ones that give the original annotations of the actioncondition dependencies.The human performance is then computed by taking the averaged metrics similar to the models on the given samples.

C Modelling Details
C.1 More on Heuristics

C.1.1 Linking Algorithm
In Section 4.2 we mention that a maximum distance of 2 steps between linked segments is imposed to filter out possible non-dependent conditions.While this still can potentially include many not-so-much depended text segments, our goal is to exploit the generalization ability of large-scale pretrained language models to recognize segments that are most probable conditions by including as much as heuristically proposed linkages as possible, which is empirically proven effective.A better strategy on making such a design choice of maximum allowed step-wise distance is left as a future work.

C.1.2 Keywords
About 3% of the entire un-annotated data have sentences containing the keywords we use in this work (Table 2).Despite the relatively small amount compared to other heuristics, they are quite effective judging from the results reported in Table 3.

C.1.3 Key Entity Tracing
For the key entity tracing heuristic described in Section 4.1.2,as long as two segments share at least one mentioned entity, they can be linked (i.e.traced by the shared entity).We do not constraint the number of key entities within a segment, so there can be more than one being used to conduct the tracing.
Constructing Entity Prediction Datasets.As mentioned in Section 4.1.2,one way to postulate the key entities is via constructing a predictive model for outputting potentially involved entities.
To do so, we firstly construct an entity vocabulary by extracting all the noun phrases within each SRL extracted segments of the entire un-annotated-set articles.To prevent from obtaining a too much large vocabulary as well as improbable entities, we only retain entities (without lemmatization) that appear with > 5 occurrences in at least one article.
We then train a language model (based on RoBERTa-large as well) where the output is the multi-label multi-class classification results on the predicted entities.When predicting the key entities for a given segment, we further constraint the predictions to be within the local vocabulary (more than 5 occurrences) within the article such segment belongs to.This model is inspired by the entity selector module proposed in (Bosselut et al., 2018) while we only consider single step statements.We verify the performance of the learned model on the dataset provided by (Bosselut et al., 2018) (the entity selection task), where our model can achieve roughly 60% on F-1 metric, indicating the trained model is sufficiently reliable.

C.1.4 Temporal Relations
We use the temporal relation resolution model from (Han et al., 2021b) that is trained on various temporal relation datasets such as MATRES (Ning et al., 2018).We train the model on three different random seeds and make them produce a consensus prediction, i.e. unless all of the models jointly predict a specific relation (BEFORE or AFTER), otherwise the relation will be regarded as VAGUE.

C.2 Development Set Performance
We select the model checkpoints to be evaluated using the held-out development split (annotateddev-set).We also report the performance on this annotated-dev-set in Table 7.

C.3 More Results on Train-Set Size Varying
Table 8 is a similar experiment as Table 4 but here we conduct the experiments with the models that do not utilize the weakly supervised data constructed with the proposed heuristics at all.One can observe that similar trends hold that a plateau can be noticed when the training set size is approaching 60%.Compared to Table 4, we can also observe that the smaller the train-set size is, the larger gaps shown between the models with and without utilizing the heuristically constructed data.This can further imply the effectiveness of our heuristics to construct meaningful data for the action-condition dependency inferring task.The models with heuristics, if compared at the same train-set size respectively, significantly outperforms every model counterparts that do not utilize the heuristics.
Table 9 reports similar experiments but in the Instructables.comannotated-test-set.Note that we perform a direct zero-shot transfer from the Wiki-How annotated-train-set, so the test-set size is always 100% for the Instructables.
Finally, both Tables 10 and 11 report the same experiments, however, this time the second-stage self-training is not applied.It is worth noting that the self-training is indeed effective throughout all the train-set-size and across different datasets and model variants, however, the trends of model per-   formance hitting a saturation point when the trainset size increases still hold.

C.4 Training & Implementation Details
Training Details.The maximum of 500 token length described in Section 6.1 is sufficient for most of the data in the annotated-test-sets, as evident in  the HuggingFace 10 code base (Wolf et al., 2020), and our entire code-base is implemented in Py-Torch. 11

C.5 Hyperparameters
We train our models until performance convergence is observed on the heuristically constructed dataset.
The training time for the weakly supervised learning is roughly 6-8 hours.For all the finetuning that involves our annotated-sets, we train the models for roughly 10-15 epochs for all the model variants, where the training time varies from 1-2 hours.We list all the hyperparameters used in Table 12.
The basic hyperparameters such as learning rate, batch size, and gradient accumulation steps are kept consistent for all kinds of training in this work, including training on the weakly supervised data, finetuning on the annotated-sets, as well as during the second-stage self-training.We also include the search bounds and number of trials in Table 13, that all of our models adopt the same search bounds and ranges of trials.

D Releases & Codes
The comprehensive human-annotated datasets, including both on WikiHow and Instructables.com will be released upon acceptance, along with a clearly stated documentation for usages.We plan to also release the codes (a snippet of our codes are included as a .zipfile during the reviewing period) for processing the datasets as well as the implementation of our models and proposed training methods.We hope that by sharing the essential resources, our work can incentivize more interests into research on procedural understanding that specifically targets condition and action dependencies and their 10 https://github.com/huggingface/transformers 11https://pytorch.org/applications to autonomous task-solving agents and assistant AI that guides humans throughout accomplishing complex tasks.The annotation task is designed for an intuitive click/select-then-link usage, followed by a few additional questions such as confidence level and feedback (This example is obtained from WikiHow dataset).

Figure 1 :
Figure 1: The Action Condition Inference Task: We propose a task that probes models' ability to infer both preconditions and postconditions of an action from instructional manuals.It has wide applications to e.g.assistive AI and task-solving robots.* Original instructions are rephrased for simplicity in this illustration.

Figure 2 :
Figure 2: Terminologies: (Left) We show a few exemplar actionables with their associated preconditions and postconditions .
Certain linguistic hints (e.g.SRL tags) are utilized to propose plausible (and likely) postcondition text segments.Temporal … pry off the back side of the tire first … … Step down hard on the rubber part of the tire … AFTER Precondition

[Figure 3 :
Figure 3: Model architectures: (a) Non-contextualized pairwise model: The model only considers a pair of given text segments.(b) Contextualized model:The model takes the whole instruction paragraphs (i.e.contexts) and wrap each text segment with our special tokens (<a>), where each segment representation is obtained by taking an average over its token representations.The ordered concatenated segment representations will then be fed into an MLP to make the final predictions.

Figure 4 :
Figure 4: MTurk Annotation User Interface: (a) We ask workers to follow the indicated instruction.All the blue-colored text bars on the top of the page are expandable.Workers can click to expand them for detailed instructions of the annotation task.(b)The annotation task is designed for an intuitive click/select-then-link usage, followed by a few additional questions such as confidence level and feedback (This example is obtained from WikiHow dataset).

Table 3 :
Annotated-test-set performance: The best performance is achieved by applying all of the proposed heuristics and undergoing the two-stage training: finetuned on the annotated-train-set first and then perform the self-training.We also report ablation studies on the designed heuristics, where * -indicates exclusion.Note that for the Instructables.com, both the Finetuned and the Self -training are done on the WikiHow training set and a zero-shot transfer is performed.

Table 4 :
Varying annotated-train-set size: on WikiHow (test-set size is fixed at 30%).We use the (best) model trained with all the proposed heuristics and the self-training paradigm.

Table 5 :
General statistics of the two annotated-sets: We provide the detailed component counts of the annotated-sets used in this work, including the statistics of tokens and sentences from the instruction steps (lower half).

Table 7 :
Annotated-dev-set performance on WikiHow: Similar to Table3, we report the development set performance on the WikiHow dataset (Instructables.comdoes not have the development set as we are conducting a zero-shot transfer).

Table 8 :
Varying annotated-train-set size without weakly supervised training: on WikiHow (test-set size is fixed at 30%).The model used in this experiment is without training on any of the heuristically constructed dataset, but we apply the self-training paradigm.

Table 9 :
Varying annotated-train-set size: on Instructables.com (test-set size is fixed at 100%).Note that here the train-set size is from WikiHow annotated-set, and the 30% is basically Table3.The upper half is with models that utilize both the heuristically constructed dataset and the self-training paradigm, while the lower half is with models that do not use any weak supervisions.

Table 5 .
All the models in this work are trained on a single Nvidia A100 GPU 9 on a Ubuntu 9 https://www.nvidia.com/en-us/data-center/a100/

Table 10 :
Varying annotated-train-set size: on WikiHow (test-set size is fixed at 30%).The upper half is with models that utilize the heuristically constructed dataset, while the lower half is with models that do not use any weak supervisions.Both upper and lower halves do not undergo any second-stage self-training.

Table 11 :
Varying annotated-train-set size: on Instructables.com (test-set size is fixed at 100%).The structure of this table is similar to that of Table 10, i.e. no self-training is conducted.20.04.2 operating system.The hyperparameters for each model are manually tuned against different datasets, and the checkpoints used for testing are selected by the best performing ones on the held-out development sets in their respective datasets.Implementation Details.The implementations of the transformer-based models are extended from

Table 12 :
(Kingma and Ba, 2015)is work: Initial LR denotes the initial learning rate.All the models are trained with Adam optimizers(Kingma and Ba, 2015).We include number of learnable parameters of each model in the column of # params.

Table 13 :
Search bounds for the hyperparameters of all the models.