Constructing Procedural Graphs with Multiple Dependency Relations: A New Dataset and Baseline

Current structured and semi-structured knowledge bases mainly focus on representing descriptive knowledge but ignore another commonsense knowledge (Procedural Knowledge). To structure the procedural knowledge, existing methods are proposed to automatically generate ﬂow graphs from procedural documents. They focus on extracting sequential dependency between sentences but neglect another two important dependencies (i.e., inclusion dependency and constraint dependency) in procedural documents. In our paper, we explore a problem of automatically generating procedural graph with multiple dependency relations to extend the ﬂow graph constructed by existing methods and propose a procedural graph construction method with syntactic information and discourse structures. A new dataset (WHPG) is built and extensive experiments are conducted to evaluate the effectiveness of our proposed model.


Introduction
Many well-known structured knowledge bases (e.g., Wikidata 1 ) and semi-structured knowledge bases (e.g., Wikipedia 2 ) have been built and assist many applications to achieve remarkable performance, such as the question-answering (QA) (Li and Moens, 2022), information retrieval (Zhou et al., 2022)) and recommendation systems (Cui and Lee, 2022).They focus on representing the descriptive knowledge i.e. the knowledge of attributes or features of things (Yang and Nyberg, 2015), but lack another kind of commonsense knowledge-Procedural Knowledge.Specifically, the knowledge which is in the form of procedures or sequences of actions to achieve particular goals is called as procedural knowledge, such as cooking recipes and maintenance manuals.
Generally, most procedural knowledge is expressed in unstructured texts (e.g., websites or books of cooking recipes).To extract the structured procedural knowledge, existing methods (Honkisz et al., 2018;Qian et al., 2020;Pal et al., 2021) are designed to transform the unstructured procedural documents into flow graphs (or workflows) which can effectively present the main operations and their ordering relations expressed in procedural documents.However, they only focus on extracting the sequential dependency (i.e., the dependency relation "Next" in Figure 1) between steps (operational sentences) in a procedural document, which is insufficient in real-world scenarios.As shown in Figure 1, sentences S2 and S3 are the sub-actions of sentence S1, which provide more fine-grained operational statements to finish operation S1.There is another kind of dependencyinclusion dependency between sentences S1 and S2 (or between S1 and S3).Nevertheless, the flow graphs constructed by current methods (Qian et al., 2020;Pal et al., 2021) ignore the inclusion dependencies among sentences and wrongly connect sentences S1 and S2 as a "Next" relation, as shown in Figure 1.Furthermore, declarative (or descriptive) sentences commonly appear in real-world procedural documents, which state the constraints (e.g, reasons, conditions and effects) of doing things.Current researches have shown that declarative sentences in procedural documents can provide important clues for the procedural semantic understanding and reasoning (Georgeff and Lansky, 1986) in many downstream tasks such as operation diagnosis (Luo et al., 2021) and technical maintenance (Hoffmann et al., 2022).However, current knowledge structure methods (Qian et al., 2020;Pal et al., 2021) simply transform the declarative sentences into an information flow in a flow graph (e.g., S7 → S8 in Figure 1), which neglects the constraint dependency between operational and  declarative sentences.As shown in Figure 1, the declarative sentences S7 and S8 respectively describe the effect constraint and condition constraint for the execution of operational sentence S6.
Based on the above motivations, we explore a problem of automatically constructing a procedural graph with multiple dependency relations between sentences in a procedural document.According to our observation, the syntactic structures of sentences can provide obvious features for identifying sentence types and then assist to detect the dependency relations between sentences.As shown in Figure 2, the syntactic pattern "verb(VB) obj −→noun(NN)" is a strong indicator for classifying sentences S2 and S3 into operation types.Meanwhile, the sentence type prediction can further benefit the dependency relation detection between sentences.For example, the constraint dependency cannot exist between two sentences with an operation type.Moreover, inspired by researches in discourse parsing (Zhu et al., 2019;Wang et al., 2021), we observe that the contextual dependency structures (which is called as dis-course structures) can provide features to recognize the dependency relations between sentences.As shown in Figure 1, the dependency relation between S1 and S3 can be inferred as Sub-Action according to their contextual dependency structure In our paper, we design a procedural graph construction method to detect the multiple dependency relations between sentences in procedural documents by utilizing syntactic information and discourse structures.Specifically, a GCN-based syntactic structure encoder with multi-query attention is proposed to capture the syntactic features of sentences and improve the ability to distinguish between operational and declarative sentences.Moreover, a structure-aware edge encoder is designed to assist the inference of dependencies between sentences by infusing the contextual structure features in procedural documents.Furthermore, due to the lack of dependencies between sentences in existing procedural text datasets, a new dataset WHPG is built based on the wikiHow3 database.
To summarize, the main contributions of this paper are listed as follows: • We explore a problem of automatically generating procedural graphs with multiple dependencies from unstructured procedural documents, aiming to extend the flow graphs in existing methods that ignore dependencies of sentences.To the best of our knowledge, our work is the first study focusing on generating procedural graphs with multiple dependencies from procedural documents.
• We design a GCN-based syntactic structure encoder and a discourse-structure aware edge encoder to effectively identify node types and assist the detection of dependency relations in procedural documents.
• We create a new procedural text dataset WHPG which builds dependency relations between operational and declarative sentences.
Extensive experiments are conducted on two public datasets from different domains and our created dataset to evaluate the effectiveness of our model in automatically generating the procedural graph.

Related Work
Many prominent knowledge bases such as Wiki-Data (Vrandečić and Krötzsch, 2014), Wikipedia (Lehmann et al., 2015) and FreeBase (Bollacker et al., 2007) mainly focus on representing descriptive knowledge (i.e., the knowledge of attributes or features of things (Yang and Nyberg, 2015)).But they do not sufficiently cover procedural knowledge (i.e., the knowledge of procedures or sequences of actions for achieving the particular goals (Georgeff and Lansky, 1986)).Recently, to obtain the structured procedural knowledge, two categories of methods (i.e., entitylevel and sentence-level based methods) are proposed.Specifically, the entity-level based methods (Jermsurawong and Habash, 2015;Feng et al., 2018;Mysore et al., 2019;Qian et al., 2020;Xu et al., 2020;Yamakata et al., 2020;Jiang et al., 2020;Fang et al., 2022) aim to extract the predefined entities and their relations from unstructured procedural texts (e.g., cooking recipes).However, they require large-scale fine-grained annotated data for each domain and lack domain generalization ability.
To alleviate these issues, sentence-level based methods (Pal et al., 2021) are designed, aiming to construct the flow graphs at sentence-level for procedural documents.However, they only focus on extracting the action flows with sequential dependencies, which is limited in real-world scenarios.In practice, both the inclusion dependency and constraint dependency are common in procedural texts and benefit the procedural text understanding and reasoning (Georgeff and Lansky, 1986;Hoffmann et al., 2022).Thus, our paper explores a problem of automatically generating a procedural graph with dependency relations from a procedural document.
Up to the present, several public entity-level procedural text datasets (Yamakata et al., 2020;Qian et al., 2020;Mysore et al., 2019;Mishra et al., 2018) and a sentence-level dataset CTFW (which is not publicly available due to ethical considerations) (Pal et al., 2021) are built.Nevertheless, existing public datasets do not annotate the dependency relations between sentences in a procedural document.Thus, a new dataset WHPG based on the wikiHow knowledge base is built and will be publicly available for evaluation in future research.

Problem Definition and Notations
The goal of our task is to extract the dependency relations between sentences and construct a procedural graph for each procedural document.Specifically, given a procedural document D = {s 1 , s 2 , . . ., s N }, a procedural graph G D = {D, Ψ, R} with nodes (i.e., sentences) s i ∈ D and triplets (s i , r i,j , s j ) ∈ Ψ is constructed, where r i,j ∈ R denotes the dependency relation between sentences s i and s j ; N is the number of sentences in a procedural document D. Note that the dependency relation set R contains four kinds of dependency relations: Next-Action, Sub-Action, Constraint and None, as shown in Figure 1.
To construct the procedural graphs, three subtasks (i.e., Node Type Classification, Edge Prediction and Dependency Relation Classification) are required.For Node Type Classification task, each node s i ∈ D is classified into one of the node types (i.e., Operation, Declaration, Both and None).Then, the extraction of triplets (s i , r i,j , s j ) from the procedural document D can be divided into two tasks: Edge Prediction P (s i → s j |(s 1 , s 2 , . . ., s N )) which predicts whether an edge exists for each sentence pair S i and S j ; and Dependency Relation Classification P (r i,j |s i → s j ) which classifies each sentence pair (predicted as an edge in Edge Prediction task) into one of the dependency relations (i.e., Next-Action, Sub-Action and Constraint).

Syntactic Graph Construction
The part-of-speech and syntactic structure of sentences can provide the evident clues for facilitating the inference of node types and dependency relations.For each sentence, we use the Standford CoreNLP libraries (Manning et al., 2014) to recognize the part-of-speech of each token and dependency relations among tokens, as shown in Figure 3. Thus, given the sentence s i = {x i,1 , x i,2 , . . ., x i,n }, a syntactic relational graph is created as follows: where P i represents a set of the part-of-speech in sentence s i ; R syn i is a set of syntactic dependency relations and Φ i denotes a set of triplets (p i,j , r syn j,k , p i,k ) with the part-of-speech p i,j ∈ P i of the token x i,j and syntactic dependency relation r syn j,k ∈ R syn i between part-of-speech p i,j and p i,k .

Syntactic RGCN Encoder
Each type of part-of-speech p ∈ P and dependency relation r ∈ R syn are respectively initialized into a learnable vector p ∈ R dr and a learnable weight matrix W r ∈ R dr×dr .Then, given a syntactic graph G syn i for sentence s i , each part-of-speech node p i,j is encoded by Relational Graph Convolutional Networks (RGCN) (Schlichtkrull et al., 2018) encoder as follows: (2) where N r j is the set of neighborhood node for the node p i,j with the relation r ∈ R syn i ; W 0 ∈ R dr denotes the learning parameters and l is the number of layers in RGCN encoder.

Multi-Query Syntactic-Aware Attention
Moreover, not all the syntactic features are equally important to identify the node types and the dependency relations between the nodes.For example shown in Figure 3, the syntactic pattern "VB obj −→NN" is a strong indicator for classifying nodes as "Operation" types, while the pattern "Determiner (DT) det −→Noun (NN)" does not provide explicit features for the node type classification.Meanwhile, different tasks also focus on different syntactic features.Motivated by this, a multi-query syntactic-aware attention module is designed to enable the model to pay attention to the relevant syntactic features for the target tasks.Specifically, given a sentence s i , the syntactic feature representation is obtained as follows: where n is the number of tokens in sentence s i ; N q is the number of query and q k ∈ R dr denotes a learnable vector.

BERT-based Bi-GRU Encoder
For a sentence s i , each token x i,j can be encoded into a numerical vector v i,j ∈ R d bert by the pre-trained language model BERT (Kenton and Toutanova, 2019).Then, a hierarchical GRU encoder consisting of two bidirectional GRUs (Bi-GRU) is utilized to learn the contextual features.Specifically, given a sequence of embedding vectors E i = {v i,1 , v i,2 , . . ., v i,n } of sentence s i , the first Bi-GRU is utilized to encode them as h i ∈ R dgru by concatenating the last hidden states from the two directions.In this way, each sentence s i is encoded into a vector h i .Then, the procedural document D can be encoded as a sequence of vectors where N is the number of sentences in the procedural document.Moreover, to capture the global contextual features, the second Bi-GRU encoder is adopted to transform , where v gru i ∈ R dgru denotes the feature representation of the sentence s i ∈ D.

Feature Fusion
Each sentence s i ∈ D can be obtained by concatenating the syntactic feature representation v syn i and the semantic feature representation v gru i , as follows: where [•; •] denotes the concatenation operation for the given two vectors.

Structural-Aware Edge Feature Representation
The contextual dependency structures in a procedural document have been proven to be effective in discourse parsing (Shi and Huang, 2019;Wang et al., 2021).The structure-aware attention is designed to capture the contextual structure features for each target sentence pair in both edge prediction and relation classification.Specifically, given a node pair (s i , s j ), the edge representation r init i,j is initialized by concatenating the syntactic feature representation v syn i , v syn j and the distance embedding v dist i,j as follows: where i < j, j − i < win and win is the longest distance length between the given two nodes in a procedural document.Then, we update the node representation v i in Equation ( 4) with the contextual features as follows: where W Q ,W F ,W K , W V and W R are learnable parameters and d r + d gru is the dimension of the node representations.Finally, the edge representation r i,j is calculated by refusing the node features, as follows: where W r , W z and W h are the learnable parameters; denotes the dot-product operation.

Projection and Loss Function
The representation of node s i and edge r i,j can be encoded by Equation (4) and Equation (7) as v i and r i,j .We adopt a projection layer with a softmax function to calculate the probability distribution of categories (i.

Experiment
We firstly introduce the construction of the new dataset WHPG and then analyze the experimental results in detail.

Dataset Collection & Annotation
We build the original corpus from the online wik-iHow knowledge base (Anthonio et al., 2020) which provides a collection of how-to articles about various topics (e.g., entertainment and crafts).
We exploit the wikiHow knowledge base to create WHPG, a dataset of procedural texts with dependency-relation annotations among operational and declarative sentences.The online wik-iHow knowledge base provides an Export pages4 service which allows exporting the texts of wiki-How articles (Anthonio et al., 2020).We adopt the python library urllib5 to request the Export pages services and crawl procedural articles.With the candidate set of procedural documents, we filter out the unnecessary information (e.g., writing date, citations and URLs).The procedural documents containing only one step are also filtered out.Finally, three parts (i.e., titles, method names and steps of procedural documents) are kept to form a complete instance.Statistically, we obtain a candidate set of 330 procedural documents from the Crafts topic.
As shown in Figure 1, for each procedural document, we provide three kinds of annotations: sentence type (i.e., "Operation", "Declaration", "Both" or "None"), edge (i.e., the connections between two sentences) and dependency relation (i.e., "Next-Action", "Sub-Action" and "Constraint").Three well-educated annotators are employed to make annotations by averaging the candidate procedural documents using the BRAT tool6 .To ensure the annotation quality, each annotator are required to give the confidence score for each annotated label.
We weigh the confidence score of each annotator for the same label and the label with the highest score will be preserved.Moreover, the annotation samples with the lowest confidence scores will be brainstormed to determine the final annotation results.Moreover, the labeled instances that are difficult to reach a consensus will be discarded.Finally, two well-trained annotators are required to recheck all the annotation results to further ensure the annotation quality.Finally, the final dataset contains 283 procedural documents with about 7341 edges.The statistical comparison of WHPG with existing sentence-level procedural text datasets is shown in Table 1.Moreover, the statistics of sentence (node) types and dependency relations of our created dataset WHPG is shown in Table 2.We also show the label distributions of the node types and dependency relations, as shown in Table 3 and Table 4.

Datasets & Experimental Settings
We conduct extensive experiments7 on our annotated dataset WHPG.Following Pal et al. ( 2021), we split WHPG dataset into train, validation and test sets with 7:1:2 ratio.Furthermore, two public datasets (i.e., COR (Yamakata et al., 2020) and MAM (Qian et al., 2020)   utilized to conduct the comparative experiments on the Edge Prediction task.
For the Node Type Classification task, we use the accuracy as the evaluation metric.Considering the label imbalance in Edge Prediction task, F1score of the positive class (i.e., the sentence pair existing an edge) is used as the evaluation metric for the Edge Prediction.The performance of Dependency Relation classification task is affected by the previous stage Edge Prediction.Thus, we combine them to make an evaluation with the F1-score metric (i.e., Edge&Rel in Table 5).
In the edge prediction and dependency relation classification tasks, each sentence in the procedural document needs to be respectively combined with all the following sentences to determine whether there is an edge and what types of dependency relations they have.To evaluate the generalization ability, four experimental settings (i.e., win = 5, win = 10, win = 20 and ALL) are used to evaluate the effectiveness of our proposed model.For example, for the win = 5 setting, given the first sentence s 1 ∈ D, five candidate sentence pairs i.e. {(s 1 , s 2 ), (s 1 , s 3 ), (s 1 , s 4 ), (s 1 , s 5 ), (s 1 , s 6 )} should be examined by the model to predict whether there are edges and which type of dependency relation they belong to.In training stage, we use AdamW optimizer with 4 batch size, 2e-5 learning rate and 0.4 dropout rate.

Result Analysis
To evaluate the effectiveness of our proposed model on the three tasks (i.e., Node Type Classification, Edge Prediction and Dependency Relation Clas-sification), we compare the performance of our proposed model with 7 recent related works (Pal et al., 2021;Zhou and Feng, 2022) which focus on constructing flow graphs from a procedural document, as shown in Table 5.To explore the problem of generating procedural graphs from procedural documents, a new dataset WHPG is built and utilized to perform comparative experiments on both edge prediction and dependency relation classification tasks.Moreover, another two public datasets (i.e., COR (Yamakata et al., 2020) and MAM (Qian et al., 2020)) from different domains are used to conduct the comparative experiments.Since these two public datasets ignore the dependencies between sentences, we can only perform experiments on Edge Prediction task.

Node Type Classification
As shown in Table 6, five baselines (Pal et al., 2021;Zhou and Feng, 2022) are used to perform the comparative experiments.Compared with them, syntactic structures are embedded into the node feature representations in our model.From the experimental results, our model achieves the highest accuracies than current related works on both validation and test datasets.It can evaluate that syntactic structure features can be effectively captured and further improve the ability to distinguish between operational and declarative sentences.

Edge & Relation Classification
Table 5 shows the comparative experimental results under four window size settings (i.e., 5, 10, 20 and ALL) on both edge prediction and dependency relation classification tasks.Our proposed    Furthermore, to evaluate the domain generalization ability of our model, another two public datasets (i.e., COR in recipe domain and MAM in the maintenance domain) are used to conduct the comparative experiments.As shown in Table 7, our model obtains the best performance with a large margin than all related works on both datasets.The experimental results evaluate that our proposed model can effectively identify sequential dependency between sentences and have a better domain generalization ability.

Analysis for Each Dependency Relation
Figure 4 shows the comparative experiments on extracting each type of dependency relations between sentences.Compared with the related works, our model can obtain the highest F1 scores on all dependency relation types.Note that due to the imbalance in the number of existing and non-existing edges between sentences in procedural documents, current existing methods are prone to recognize the inter-sentences as None dependency and have low performance in the three dependency relations (i.e., Next-Action, Sub-Action and Constraint).

Ablation Study
As shown in Table 5, the ablation experiments are conducted to evaluate the effectiveness of the designed modules (i.e., SynEncoder, MultiQAtt and SAtt).The ablation experimental results can evaluate that both the syntactic information and discourse structure benefit the dependency relation detection.Specifically, both the SynEncoder and MultiQAtt modules can effectively capture the syntactic features and assist the dependency relation detection.Moreover, the performance of our model can be improved effectively when the discoursestructure features are embedded by the structure aware attention module.

Visualization
As shown Figure 5, we show the weight distributions of each word measured by the Multi-Query Syntactic Aware Attention Module.We can observe that the phrases with the syntactic pattern "VB obj −→NN" (e.g" "cut→hole" in S1 and "use→cup" in S2) obtain the higher weight values than other words, which indicates the sentences as operational sentences.Moreover, the token "Then" in S3 of Figure 5 is measured as the highest weight value, which indicates the sentence has the "Next-Action" dependency relation with previous sentences.The visualization analyse can evaluate the effectiveness of our proposed module Multi-Query Syntactic Aware Attention.

Conclusion
In this paper, we explore a problem of automatically generating procedural graphs with multiple dependency relations for procedural documents.Existing procedural knowledge structured methods mainly focus on constructing action flows with the sequential dependency from procedural texts but neglect another two important dependencies: inclusion dependency and constraint dependency which are helpful for the procedural text understanding and reasoning.To solve this problem, we build a new procedural text dataset with multiple dependency relations and propose a procedural graph construction method by utilizing syntactic and discourse structure features.Extensive experiments are conducted and evaluate the effectiveness of our proposed model.

Limitations
In this section, we draw conclusions for the limitations of our proposed model in this paper.Our proposed model mainly focuses on the sentencelevel procedural graph construction.The scenario that two actions in the same sentence cannot be considered in our proposed model.It is challenging to handle multi-grained (i.e., entity-level and sentence-level) dependencies between actions.We will consider this limitation as our future work.

Figure 1 :
Figure 1: An Example of Flow Graph (Pal et al., 2021) and Procedural Graph (Ours) with Multiple Dependency Relations.

Figure 3 :
Figure 3: The Overview of Our Proposed Model with Syntactic and Document Structures.
Non-Existing} for edge prediction task and {Next-Action, Sub-Action, Constraint, None} for dependency relation classification task).Given the training dataset M , the model is trained with the following training objective: £(M, θ) = D∈M (£ t (D; θ)+£ e (D; θ)+£ r (D; θ)) (8) where £ t (D; θ), £ e (D; θ) and £ r (D; θ) are the cross-entropy loss functions for node type classification, edge prediction and relation classification tasks; and D is a procedural document from the training dataset M .

Figure 4 :
Figure 4: The Experimental Results (F1 scores) of Each Relation on WHPG Dataset.The results are reported when both edge prediction and relation classification tasks are correctly predicted at the same time.

Figure 5 :
Figure 5: The heatmap of the weight distributions for each word measured by Multi-Query Syntactic-Aware Attention.

Table 2 :
The Size of Sentence (Node) Types and Dependency Relations in WHPG.

Table 4 :
Label Distributions of Dependency Relation Classification which do not consider dependency relations among sentences) are also

Table 5 :
The Experiment Results (F1 scores) on WHPG dataset.Edge denotes the performance of edge prediction task only, and Edge&Rel which is the main metric, denotes the performance when both edge prediction and dependency relation classification tasks are correctly predicted at the same time.SynEncoder denotes the Syntactic RGCN Encoder; MultiQAtt denotes the Multi-Query Syntactic-Aware Attention module; SAtt denotes the Structural-Aware Attention module.

Table 6 :
Experimental Results (Mean Accuracy (%) from three seed values) of Node Type Classification Task on Both Validation and Test Set of WHPG dataset.

Table 7 :
The Experimental Results (F1 scores) of Edge Prediction Task on COR and MAM Datasets.