Summarizing Procedural Text: Data and Approach

,


Introduction
Procedural texts, e.g., scientific articles, instruction books, or recipes, are widely spread and useful in many real-world applications (Tang et al., 2020;Gupta and Durrett, 2019a;Du et al., 2019b).In procedural text modeling field, many research works focus on entity state tracking (Gupta and Durrett, 2019a,b;Dalvi et al., 2018;Swarup et al., 2020) and reasoning (Tandon et al., 2018), and how to summarize the procedural text has not been fully * Equal contribution.
Wash the potatoes, peel off the potatoes, and cut it into slices Cut the chicken breast into small pieces, add salt, pepper and soy sauce, stir with potatoes, and refrigerate for 2 hours Take a piece of butter from the refrigerator, put it in a baking tray with chicken and potatoes, wrap it in tin foil, and bake for 1 hours explored.Since the procedural text contains many steps and the procedure is usually long, summarizing the procedural text can save time for readers when they want to quickly locate the useful step or take an overview of the procedural text.In this paper, we propose a new summarization task: Procedural Text Summarization.Intuitively, there are two sub-tasks (shown in Figure 1): (1) summarize each procedure (Step-View); (2) summarize all the procedures into a comprehensive summary (Global-View).The first one aims to summarize the main action in a procedure by incorporating contextual information from related procedures, and the second one aims to capture the salient steps from all procedures by leveraging the structure (a.k.a., relationship) of all procedures.
In each procedure, the main content is the description of actions, e.g., cut the whole potato into slices, heat the iron at room temperature to 1500 degrees.And these actions usually cause the state changes of the entity, e.g., from room temperature state change to 1500 degrees of the entity iron.Thus, the core of the procedural text summarization model is to capture the salient entity and describe the trace of entity state changes.In order to generate a better summary for procedural text, there are two challenges which should be tackled: (1) the first challenge is modeling the relationship between procedures to explore entity states in past and future and capture the comprehensive contex-tual information; and (2) the second is to identify the salient entity for the procedure.However, the plain text summarization methods (Zhang et al., 2020b;Zhu et al., 2021) only incorporates specific procedure text and cannot model the relationship between contextual procedures.Many research works (Du et al., 2019a;Tandon et al., 2018) on procedural text focus on extracting the state changes of each entity involved in the process.Based on these existing entity state change analysis methods, to capture the salient procedure and tackle the procedural text summarization task, we propose to employ the trace of entity state changes to explicitly construct the relationship between procedures.
In this paper, we propose a procedural text summarization framework named Entity-State Graphbased Summarizer (ESGS) that constructs a heterogeneous graph with procedure nodes and entity nodes.To construct the relationship between nodes, we follow the existing procedural text state tracking method (Tandon et al., 2020) which is based on GPT-2 (Radford et al., 2019).Then we propose the entity state-aware message passing method on the graph to understand the procedural text in a comprehensive perspective.In order to identify the salient entity, we propose an entity selection module by using the graph representation of procedure.Finally, we employ a pre-trained language model to incorporate the graph and salient entity representation to generate the summary for procedural text.To verify the effectiveness of ESGS, we firstly conduct the experiment on the benchmark dataset WikiHow proc , and we also propose a new procedural text summarization dataset PsyStory which summarizes the procedural text in globalview.Extensive experiments on these two datasets demonstrate that the ESGS brings substantial improvements over several strong baselines including state-of-the-art summarization method.
To sum up, our contributions can be summarized as follows: • We propose a procedural text summarization task that aims to generate two granularity of summaries for each procedure and all procedural text.
• We propose to leverage the entity state tracking method to construct a heterogeneous graph, and then generate a summary by incorporating the salient entity and graph representation.
• We propose a new procedural text summarization dataset PsyStory.
• Experiments conducted on two datasets show that our ESGS method outperforms all baselines, including the state-of-the-art summarization model.

State Tracking in Procedural Text
Procedural text is a domain of text involved with understanding some kind of process, such as a phenomenon arising in nature or a set of instructions to perform a task.Entity tracking is the core of understanding the procedural text.The goal is to track the sequence of state changes (e.g., creation and movement) entities undergo over long sequences of procedure steps.Dalvi et al. (2019) propose to use the WikiHow data to train the state change tracking model with limited states, which is an open-domain procedural text dataset.Past work involves both modeling entities across procedure steps (Das et al., 2019;Tang et al., 2020;Kiddon et al., 2015;Gupta and Durrett, 2019b).Tandon et al. (2020) firstly propose a GPT-2 based entity state tracking method which can be used to analyze the open-domain procedural text with unlimited state space.

Text Summarization
Abstractive summarization methods (Gehrmann et al., 2018a;Jin and Wan, 2020;Maynez et al., 2020;Liu and Liu, 2021) aims to generate a fluent and condensed short text to cover the main idea of the input document.Many researchers use the sequence-to-sequence based framework to read the document first and generate a summary by decoder (See et al., 2017;Lin et al., 2018;Celikyilmaz et al., 2018).With the development of pretraining techniques, the fluency of abstractive summarization has been significantly improved (Lewis et al., 2020;Zhang et al., 2020b) by using largescale plain text.However, most of the existing summarization research works concentrate on summarizing plain documents (Gao et al., 2020), and the genre of procedural which is usually long and contains many detailed facts has not been fully explored in the summarization research field.Although Koupaee and Wang (2018) propose to use the WikiHow as the summarization dataset, the research works on this dataset concatenate all the steps in procedural text and treat it as the plain document summarization task.

Problem Formulation
Given a procedural text P = {s 1 , •  {w i,1 , • • • , w i,Ls } contains L s words.In step-view sub-task, our goal is to generate the summary for each procedure , where Ŷ i is for procedure s i and it has L y is the number of words of the summary.And in the global-view, we aim to generate a summary for all the procedures in P .Finally, we use the difference between generated summary and the ground truth as the training objective.In the following sections, we use the step-view as the example to illustrate our method, and the model difference for global-view is introduced in § 4.7.

Overview
In this section, we introduce the Entity-State Graph-based Summarizer (ESGS).Figure 2 shows an overview of ESGS which has four main parts: • Procedural Graph Construction uses the GPT-2 based method to analyze the entity state changes in the procedure and uses the trace of entity state to construct the graph.
• Procedural Graph Encoding employs the stateaware message passing method to model the contextual information for each procedure.
• Entity Selection Module uses the posterior information to predict salient entity in the procedure.
• Summary Generation employs the pre-trained BART and uses a graph attention layer to incorporate the selected entity and the contextual information of procedure in the summary generation.
Since there are two procedural text summarization sub-tasks: step-view and global-view summarization.In the following sections, we use the step-view procedural text summarization task as the example to illustrate the details of the ESGS model.Then the step-view ESGS model can be easily adapt to the global-view summarization task with only two small modifications, and we will illustrate this variant model in § 4.7.

Preliminary
Heterogeneous Graph.In ESGS, we employ a heterogeneous graph to model the relationship between procedures and entities, which is an information network with two types of nodes and edges (Sun and Han, 2013;Wang et al., 2019).A heterogeneous graph, denoted as G = (V, E), consists of a node set V and an edge set E, and associates with a node type mapping function φ : V → A and an edge type mapping function ψ : E → R, where A and R denote the node and edge types.In Figure 3, we construct a heterogeneous graph with two types of node: procedure and entity, and three types of links.Metapath (Sun et al., 2011).A metapath ρ is defined as a path in the form of which describes a composite relation between node type A 1 and A l .Each metapath may have multiple metapath instances.As shown in Figure 2, two procedures can be connected via two metapaths, e.g., "procedure-procedure", and "procedure-entity-procedure", where metapath "procedure-entity-procedure" has 3 metapath instances which is shown using the yellow line.

Procedural Graph Construction
We first leverage an open-domain entity state tracking method ProcGPT (Tandon et al., 2020) to analyze the procedural text, which is based on the pre-training language model GPT-2 (Radford et al., 2019).Thus, we employ this method to extract the entity and states from a procedure step text as a set of tuples "(entity e i , before-state t i b , after-state t i a )": where s i is the i-th step procedure text.We finetune the ProcGPT (Tandon et al., 2020) using the procedure text and entity tuple parallel training data to generate the tuples for a procedure step text s i .Figure 3 shows an example of graph construction, we first use the finetuned ProcGPT to extract entitystates tuples.Then we use the procedure step text and entity word as two types of graph node, and build the edge between entity and its states with the procedure step node.For brevity, we only use one entity and state for a procedure step to illustrate the model in following sections, and there can be more than one entity for each step in the real datasets.
After obtaining the entity and the corresponding state in each procedure, we employ these relationships to build a graph for the procedural text.Intuitively, we have three types of edges in the graph: (1) edge between two adjacent procedures; (2) edge from procedure node to entity node; (3) edge from entity node to procedure node.And we store the "after state" on the (2) edge and store the "before state" on the (3) edge.

Procedural Graph Encoding
First, we employ the pre-trained BART (Lewis et al., 2020) encoder to transform the procedure text into vector representations: where Enc is the BART encoder which outputs the vector representation h i,j of j-th input word w i,j in i-th procedure.To obtain a vector representation of each procedure, we extract the hidden state h i,0 of the special token [CLS] as the representation s i = h i,0 of i-th procedure.We use the P = {s 1 , . . ., s Lp } to denote the representations for all procedures.Similarly, we use the same method to encode the entity word and its states into vector representations: (3) where e i is an entity word in i-th procedure, and t ij a , t ij b denote the "after state" and "before state" between i-th and j-th procedure.
To capture the contextual information for a procedure step, inspired by the Heterogeneous graph Attention Network (HAN) (Wang et al., 2019), we propose an entity state-aware graph encoding method that integrates the state into the message passing between the nodes.Before conducting message passing between nodes in the multi-layer graph, we use the s i , e i , t ij b , t ij a as the initial node representation for procedure, entity and state, respectively.Similar to the HAN, our method also follows a hierarchical attention structure: from node-level to semantic-level.Node-level attention assigns different weights for neighbor nodes on metapath.We extract two homogeneous subgraphs G 1 and G 2 which are constructed from the original heterogeneous graph G. G 1 and G 2 are the sub-graphs in which procedure nodes are connected by metapath "procedure-procedure" and "procedure-entity-procedure", respectively.Specifically, in original heterogeneous graph G, an edge where || denotes the concatenate operation.
First, the node representation s i is mapped to a feature space by transformation matrix M l , l = {1, 2} for each sub-graph: Then the model learns the weight a l ij for each neighbour node pair s i and s j in sub-graph G l .In G 1 , the procedure nodes are connected sequentially, the weight is calculated by the node representations: where Attn denotes the multi-head (Vaswani et al., 2017) node-level attention operation.In G 2 , since the procedure nodes are connected by the shared entity and the state trace of entity is stored on the edge, we incorporate the edge attributes information to calculate the weights between nodes: where MLP is a fully connected layer with activation function.
After obtaining a l ij between procedure i and j, we normalize them to get attention weight α l ij : Then the updated representation z i l of node s i is aggregated by the weighted sum of all neighbour nodes in sub-graph G l : where σ is an activation function, N l i represents neighbour nodes of i in sub-graph G l .Next, we combine the procedure node representations {z i 1 , z i 2 } for different metapaths into an overall representation z i by using semantic-level attention.We use the same semantic-level attention as the original HAN, and we refer readers to the HAN (Wang et al., 2019) for more details.

Entity Selection Module
To generate a concise summary for the procedure which captures the main actions, we should select the salient entity from the procedure text.When the model focuses on different entities, a different summary can be generated.In this paper, we propose to use a entity selection module to predict salient entity to help the summarization model focus on main procedure actions.
When training the summarization model, if we predict the entity selection only based on the procedure (a.k.a., prior information) without knowing the ground truth summary (a.k.a., posterior information), it is difficult to generate a better summary since a salient entity might not be selected accurately.It will be sub-optimal to train the summarization model by selecting trivial entities since it cannot provide any helpful training signals (Lian et al., 2019).In contrast, if we use procedure text and the ground truth summary to predict the posterior distribution over entities, it can provide an effective training signal since the ground truth summary contains the salient entities.
In this paper, we propose to select the salient entity by using both prior and posterior information.We first encode all the entities into vectors e = {e 1 , . . ., e Le } using Equation 2, where e is the entity set for i-th procedure with L e entities, and we omit the subscript i for brevity.In the prior entity distribution, we define a conditional probability distribution over all the entities e using the procedure text s i , denoted by p(e|s i ).Specifically, we incorporate two types of information to model the prior entity distribution p(e|s i ): (1) the graph representation z i for procedure s i with the contextual information of the related procedures; (2) the vector representation s i which is encoded by the pre-trained language model.

p(e|s
where ⊕ denotes the vector concatenation, and MHAttn denotes the multi-head attention mechanism (Vaswani et al., 2017) to measure the relationship between each entity and the procedure s i Then, we use the prior entity distribution to weighted sum the entity representations as selected entity representation E prior : In the posterior entity distribution, we calculate it by adding the ground truth summary Y i , denoted by p(e|s i , Y i ).We use the same BART encoder (Equation 2) to obtain the representation Y i of the ground truth summary Y i .

p(e|s
Similarly, we also obtain the selected entity representation E post i by posterior distribution using the same method as Equation 10.Different from the prior information, the posterior information contains more accurate entity selection information which is predicted by the ground truth summary.
In the training phase, we employ the selected entity representation by the posterior distribution in the summary generation, and in the testing, we use the selected entity representation by the prior distribution since the ground truth summary is not available.Intuitively, there is a discrepancy between prior and posterior that will lead to the mismatch between the training and testing.Thus, we employ the KL divergence as the training objective to minimize the distance between between the prior and posterior distribution: Inspired by Zhao et al. (2017), to ensure the accuracy of the selected entity, we enforce the relevance between the selected entity and the ground truth summary.Specifically, we apply a fully connected layer which uses the selected entity representation by posterior distribution as input and predicts the the bag-of-word (BOW) of the ground truth summary Y i : where W, b are the trainable parameters.

Summary Generation
We employ the pre-trained language model BART (Lewis et al., 2020) as the decoder to generate the summary.In order to incorporate the selected salient entity and the contextualized procedure representation from the graph model, we propose to insert an additional attention layer in the BART.We first apply the self-attention on the masked output summary embeddings and result in the self-attention output a s .This process is the same as the original Transformer (Vaswani et al., 2017) and we omit it due to the limited space.In the original BART, we should use the output a s to cross-attend to the word-level procedure hidden states {h i,1 , • • • , h i,Ls } produced by the BARTbased procedure encoder (Equation 2).To aggregate salient information from both of the updated graph node z i and selected entity representations , we concatenate these vector representations with the word-level procedure hidden states in the cross-attention layer: where E * i denotes the selected salient entity by prior or posterior distribution in testing or training phase respectively.Finally, we apply a fully connected feed-forward network on a g to predict the distribution over the vocabulary of the generated summary.We use the cross-entropy loss L i ce between generated summary Ŷ i and ground truth summary Y i to optimize all the parameters of ESGS, and the final loss function L i for i-th procedure is defined as:

Model Variant for Global-View Setting
In the global-view procedural text summarization, we should summarize all the procedures instead of summarizing each procedure separately.There are two small differences in the model for global-view procedural text summarization.First, in the entity selection module, instead of selecting a salient entity for each procedure, we use all the procedure graph nodes to predict the salient entity for the whole procedural text.Second, we modify the cross attention in decoder (Equation 15) that concatenates all the procedures graph nodes representation {z 0 , . . ., z Lp } instead of only use one procedure graph node.
5 Experimental Setup

Dataset
To validate the effectiveness of the procedural text summarization methods, we propose two datasets: WikiHow proc and PsyStory.Detail statistics are shown in Table 1.We show the performance of some summarization baselines on these datasets in Table 2.
WikiHow proc is a modified version based on WikiHow dataset (Koupaee and Wang, 2018) which contains articles describing procedural tasks about various topics (from arts and entertainment to computers and electronics) with multiple steps.Many existing procedural text state analysis methods (Tandon et al., 2020;Zhang et al., 2020c,a;Dalvi et al., 2019;Goyal et al., 2021) have conducted experiments on the WikiHow dataset, and this is the benchmark dataset on procedural text modeling.Each article consists of multiple paragraphs and each paragraph starts with a sentence summarizing it.As illustrated in previous sections, we use the ProcGPT (Tandon et al., 2020) to annotate entity and state changes of WikiHow dataset, and then remove the low-quality data samples which do not have any entity or only a few entities.Rashkin et al. (2018) propose a story dataset with the explanation of characters' naive psychology as fully-specified chains of mental states for motivations and emotional reactions.This annotation is in sentence-level, and we can see the character as an entity and mental state as the entity state.We use an outsourcing human annotation service (we paid $1 for each data sample) with the independent annotation quality control to write the summary for procedural text.Since the service vendor have independent quality control, we deem the summary as a high-quality ground-truth summary.

Evaluation Metrics
We adopt ROUGE score (Lin, 2004) and BLEU (Papineni et al., 2002) which are widely applied for summarization and text generation evaluation (Gao et al., 2019;Chen et al., 2018).The ROUGE metrics compare generated summary with the reference summary by computing overlapping lexical units, including ROUGE-1/2 (n-gram), and ROUGE-L (longest common subsequence).

Comparison Methods
To prove the effectiveness of each module, we conduct ablation studies that remove each key module in ESGS, and then form 3 baseline methods: (1) ESGS-MsgPass uses the original Heterogeneous Graph Attention Network (HAN) as the graph encoder which removes the entity state-aware message passing module in ESGS.( 2 Apart from the ablation study, we also compare with the following baselines: (1) TextRank (Mihalcea and Tarau, 2004) is a graph-based ranking model for extractive document summarization.(2) S2SA is the Sequence-to-Sequence framework (Sutskever et al., 2014) which is equipped with the attention mechanism (Bahdanau et al., 2015) as a baseline method.
(3) PGNet (See et al., 2017) propose the copy mechanism to directly copy out-of-vocabulary words from input document to summary.( 4) BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) are large-scale pre-trained language models, and we fine-tune these models on our procedural text summarization task.( 5) PEGASUS (Zhang et al., 2020b)  BottomUp (Gehrmann et al., 2018b) is an entitydriven summarization method.( 7) BART+CTX is based on BART and concatenates the surrounding two procedures with the original input procedure as contextual information.

Implementation Details
The batch size is 16 with gradient accumulation to simulate a large batch size.We pad or cut the procedure to contain 200 words, and the maximum decoding length is 200.We initialize BART with BART base1 with 16 attention heads, 768 hidden size and 6 Transformer layers.
6 Experimental Result

Overall Performance
In Table 2, we examine the performance of our model and baseline methods on two datasets in terms of ROUGE score.We can see that ESGS achieves a 39.93%, 60.51%, and 38.74% increment over the state-of-the-art summarization method PEGASUS in terms of ROUGE-1, ROUGE-2, and ROUGE-L on the benchmark dataset WikiHow proc .On the global-view procedural text summarization task, we can find that ESGS also outperforms other baseline methods in terms of three metrics.It is worth noticing that the baseline model BART+CTX which also incorporates the contextual information of a procedure outperforms other baselines, which demonstrates the effectiveness of using the contextual information.However, the performance   of BART+CTX is still 3.33% worse than ESGS in terms of ROUGE-1 score, which indicates that the change of entity state is important for summarizing the procedural text and such a simple concatenation method cannot fully explore the relationship between procedures.And the observation that the ESGS outperforms the BART on PsyStory also verifies this assumption, since BART also uses all the procedures in summarization.

Ablation Study
To verify the effectiveness of each module in ESGS, we conduct several ablation models (shown in § 5.3) on both datasets, and the result is shown in Table 3.All ablation models perform worse than ESGS in terms of all metrics on both datasets, which demonstrates the preeminence of ESGS.We can find that the ESGS-MsgPass performs worse among all the ablation models, and it confirms that the entity state information can help the model to identify whether the message is salient for passing among procedure nodes.

Effectiveness of Different Metapath
In our ESGS model, we employ a metapath-based heterogeneous graph that passes messages among nodes along two metapaths.In this section, we conduct experiments of removing each metapath from the model to prove the effectiveness metapaths.Table 4 shows the ROUGE scores for each ablation model.We can find that the metapath "sentence-entity-sentence" contributes most to the Procedures #1.Use a soft cloth and only dampen it rather than soak it, so that the carpet or rug is not made wet, only moistened.(cloth, dry, wet) #2.Keep to the pile of the carpet and wipe in this direction at all times.(carpet, dirty, clean), (cloth, wet, hold), (hand, empty, holding cloth) #3.Rub warm breadcrumbs through the surface of carpet.This will bring out the colour again. Ref.
Wipe across the carpet or rug using even, long strokes B.+C.

ESGS
Wipe down the surface of the carpet with a damp cloth summarization performance since it connects disadjunct procedures using the common entity and models the transition trace of the entity state which is important for summarizing the salient entity and its state changes.

Case Study
Table 5 shows a procedural text from WikiHow proc dataset and its corresponding summaries generated by different methods in the step-view task.Due to the limited space, we show an example with short procedures.We can observe that BART based baselines generate a fluent summary with incomplete facts.On the contrary, ESGS produces a fluent summary that is consistent with the main step of the procedural, since the relationship between step #1 and #2 is captured by the "procedure-entityprocedure" metapath.

Conclusion
In this paper, we propose the procedural text summarization task which aims to generate two granularity summaries (step-view and global-view).We propose to use the heterogeneous graph model Entity-State Graph-based Summarizer (ESGS) to construct the relationship between procedures using the trace of entity state changes.To focus on the salient entities, we also propose an entity selection module that is trained by using posterior information to provide effective guidance.Finally, we generate the summary by incorporating the updated graph and selected salient entities.And we conduct experiments on a modified version of benchmark dataset WikiHow proc , and we also construct a new procedural text summarization dataset PsyStory with global-view summary.ESGS achieves the state-of-the-art performance on both datasets.

Limitations
Since we use a large pretrain language model as the backbone of our proposed medthod, it is hard to deploy on the edge devices or mobile phones.
We can employ the model compression method to accelerate the inference speed on the server and provide services to the user through internet.

Figure 1 :
Figure 1: Example of procedural text summarization with two sub-tasks for different granularity summary.

Figure 2 :
Figure 2: Overview of ESGS with four parts: (1) Procedural Graph Construction (refer to Figure 3 for more details); (2) Procedural Graph Encoding; (3) Entity Selection Module; and (4) Summary Generation.And we use the step-view summarization process for 2-nd procedure as an example.

Figure 3 :
Figure 3: Graph construction from procedural text.Graph contains two types of nodes: sentence and entity.The states of entity store in the edge.
) ESGS-KLLoss removes the KL-divergence loss from the training objective and only uses the prior information in training and testing.(3) ESGS-BOWLoss removes the bag-of-word loss from the training objective.
• • , s Lp } with L p procedures and each procedure s i =

Table 2 :
is a largescale pre-trained Transformer model with a new self-supervised summarization objective, and this method achieves the state-of-the-art performance on many summarization benchmark datasets.(6)Automatic metrics comparison between baselines on two datasets.

Table 3 :
Comparison between ablation models.

Table 5 :
Examples of the generated summary by ESGS and BART+CTX for procedure #2.Text in blue denotes the contextual information from other procedures.