Reasoning over Entity-Action-Location Graph for Procedural Text Understanding

Procedural text understanding aims at tracking the states (e.g., create, move, destroy) and locations of the entities mentioned in a given paragraph. To effectively track the states and locations, it is essential to capture the rich semantic relations between entities, actions, and locations in the paragraph. Although recent works have achieved substantial progress, most of them focus on leveraging the inherent constraints or incorporating external knowledge for state prediction. The rich semantic relations in the given paragraph are largely overlooked. In this paper, we propose a novel approach (REAL) to procedural text understanding, where we build a general framework to systematically model the entity-entity, entity-action, and entity-location relations using a graph neural network. We further develop algorithms for graph construction, representation learning, and state and location tracking. We evaluate the proposed approach on two benchmark datasets, ProPara, and Recipes. The experimental results show that our method outperforms strong baselines by a large margin, i.e., 5.0% on ProPara and 3.2% on Recipes, illustrating the utility of semantic relations and the effectiveness of the graph-based reasoning model.


Introduction
Procedural text often consists of a sequence of sentences describing processes, such as a phenomenon in nature (e.g., how sedimentary rock forms)  or instructions to complete a task (e.g., the recipe of Mac and Cheese) . Given a paragraph and its participant entities, the task of procedural text understanding is to track the states (e.g., create, move, destroy) and locations (a span in the text) of the entities. Compared with traditional machine reading task, which mainly focuses on the static relations among entities, procedural text understanding is more challenging since it involves discovering complex temporal-spatial relations among various entities from the process dynamics.
To effectively track the states and locations of entities, it is crucial to systematically model rich relations among various concepts in the paragraph, including entities, actions, and locations. Three types of relations are of particular interest.
First, mentions of the same entity in different sentences are related. The inherent relation among these mentions may provide clues for a model to generate consistent predictions about the entity. For example, the entity electrical pulses are mentioned in two sentences "The retina's rods and cones convert it to electrical pulses. The optic nerve carries electrical pulses through the optic canal.". Connecting its two mentions in two sentences helps to infer its location in the first sentence using the second sentence's information.
Second, detecting connections between an entity and the corresponding actions helps to make state predictions more accurate. Take the sentence "As the encased bones decay, minerals seep in replacing the organic material." as an example. The entity bone is related to decay which indicates the state destroy, while it is not connected to seep indicating the state move. Given the relation between bone and decay, it is easier for the model to predict the state of bone as destroy, instead of being misled by the action seep.
Last, when the state or location of one entity changes, it may impact all associated entities. For example, in sentence "trashbags are thrown into trashcans.", trashbags are associated with trashcans. Then, in the following sentence "The trashcan is emptied by a large trash truck.", although trashbags are not explicitly mentioned, their loca-tions are changed by the association with trashcan.
Recent works on procedural text understanding have achieved remarkable progress Gupta and Durrett, 2019b;Du et al., 2019;Das et al., 2019;Gupta and Durrett, 2019a). However, the existing methods do not systematically model the relations among entities, actions, and locations. Instead, most methods either leverage inherent constraints on entity states or exploit external knowledge to make predictions. For example, Gupta and Durrett (2019b) propose a structural neural network to track each entity's hidden state and summarize the global state transitions with a CRF model.  inject commonsense knowledge into a neural model with soft and hard constraints. Although Das et al. (2019) model the relation between entities and locations, there is no general framework to model the relations, and some important relations, such as entityaction and entity-entity relations, are ignored.
A general framework to systematically model the rich types of relations among entities, actions, and locations is essential to procedural text understanding. To the best of our knowledge, we are the first to explore comprehensive relation modeling, representation, and reasoning systematically. Specifically, we first construct an entity-actionlocation graph from a given paragraph, where three types of concepts (i.e., entities, locations, and actions) are identified and extracted as nodes. We then detect critical connections among those concepts and represent them as edges. Finally, we adopt a graph attention network to conduct Reasoning over the Entity-Action-Location graph (REAL), which provides expressive representations for downstream state and location predictions.
We evaluate the proposed approach on two benchmark datasets for procedural text understanding, ProPara  and Recipes . Our approach outperforms the state-of-the-art strong baselines by a large marge, i.e., 5.0% on ProPara and 3.2% on Recipes. The ablation study and analysis show that the graph-based reasoning approach generates better representations for entities, locations, and actions. Thus, it is highly valuable for both state and location tracking of entities.

Related Work
REAL is closely related to two lines of works, i.e., procedural text understanding and graph reasoning in language understanding.
Procedural Text Understanding. Compared with early-stage models (Henaff et al., 2017;Seo et al., 2017), recent progress in the procedural text understanding task is mainly made on ensuring the prediction's consistency or injecting external knowledge. Various approaches Gupta and Durrett, 2019b;Amini et al., 2020) have been proposed to predict consistent state sequence. For example, NCET (Gupta and Durrett, 2019b) tracks the entity in a continuous space and leverages a conditional random field (CRF) to keep a consistent prediction sequence. Other models inject knowledge from external data sources to complement missing knowledge. ProStruct  introduces commonsense constraints to refine the probability space, while KOALA (Zhang et al., 2020) leverages Bert Encoder pre-trained on related corpus from Wiki, and injects the Concept-Net (Speer et al., 2017) knowledge. Besides, a few models (Das et al., 2019; are proposed to build graphs on the procedural text. For instance, KG-MRC (Das et al., 2019) constructs dynamic knowledge graphs between entities and locations. However, these methods can not systematically capture the relations among entities, actions, and locations, and entity-action and entity-entity relations are ignored.
Graph Reasoning in Language Understanding. Graph-based reasoning methods (Zeng et al., 2020;Zhong et al., 2020;Zheng and Kordjamshidi, 2020) are widely used in natural language understanding tasks to enhance performance. For example, Zeng et al. (2020) constructs a double graph design for the document-level Relation Extraction (RE) task, Zhong et al. (2020) constructs the retrieved evidence sentences as a graph for Fact-Checking task. Compared with these works, the entity-actionlocation graph in our approach copes better with procedural text understanding task since it precisely defines concepts we are concerned within the task and captures the rich and expressive relations among them.
3 Model Task Definition. The procedural text understanding task is defined as follows. Given a paragraph P consists of T sentences (S 1 , S 2 , ..., S T ), describing the process (e.g., photosynthesis, erosion) of a set of N pre-specified entities {e 1 , e 2 , ..., e N }, we need to predict the state y s t and location y l t for each entity at each step t corresponding to sentence S t 1 . Candidate states are pre-defined (e.g., y s t ∈ {not exist (O), exist (E), move (M), create (C), destroy (D)} in the ProPara dataset), and location y l t is usually a text span in the paragraph. Gold annotations for state and location at each step t are denoted as y s t and y s t , respectively.  Figure 1 shows the overview of our approach, which consists of three main components: graph construction, graph-based representation learning, and prediction module. The graph construction module extracts nodes and edges from the input procedural paragraph and constructs a graph. The graph reasoning module initializes nodes representations using contextual word representations and reasons over the built graph. Finally, the prediction module leverages the graph-based representations to predict the state and location. Figure 2 shows an example of the graph constructed for a paragraph which describes how fossil forms.

Graph Construction
Nodes Extraction. We first extract text spans as nodes from the given paragraph. The text spans in the extracted nodes should cover all essential concepts in the paragraph. Three types of concepts play an important role in the entity tracking task, i.e., actions, entity mentions, and location mentions. Therefore, we extract nodes for them and get all the nodes N = {N a , N e , N l } where N a represents 1 We will use step and sentence interchangeably. action nodes, N e represents entity mention nodes, and N l represents location mention nodes. We first tag all the verbs by an off-the-shelf partof-speech (POS) tagger 2 and construct a set of action nodes N a with each node associated with a single verb or a phrase consisting of two consecutive verbs. For the entity mentions, we extract the explicit (exact matching or matching after lemmatization) or implicit (pronouns) mentions of all the entities. Coreference resolution is used to find pronoun mentions in data pre-processing. Besides, we utilize the POS tagger to extract location mentions. Each tagged noun or consecutive phrase of adjective + noun is identified as a location mention.
Edges Generation. Capturing the semantic relations between various nodes is critical for understanding the process dynamics in the procedural text. To this end, we first derive verb-centric semantic structures via semantic role labeling (SRL) 3 (Shi and Lin, 2019) for each sentence and then establish intra-and inter-semantic structure edges.
Given a verb-centric structure consisting of a central verb and corresponding arguments, we create two types of edges. (1) If an entity mention n e ∈ N e or location mention n l ∈ N l is a substring of an argument for verb n a ∈ N a , then we connect n e /n l to n a . For example, for the sentence "As the encased bones decay, minerals seep in replacing ...", the verb decay has an argument the encased bones where bones is an entity mention, then we will connect the action node decay and entity mention node bones.
(2) Two mentions in two arguments of the same verb are connected too. For example, for the sentence "The trashbags are thrown into a large outdoor trashcan", the verb thrown has two arguments, the trashbags and into a large outdoor trashcan, then we connect the two mention nodes trashbags and trashcans.
We also create edges between mentions of the same entity in different semantic structures. For example, in Figure 2, the entity bones are mentioned in two sentences, which correspond to two entity mention nodes. We connect these two nodes to propagate information from one to the other during graph-based reasoning.

Graph-based Representation Learning
Nodes Representation. We first feed the entire paragraph to the BERT (Devlin et al., 2019)  bone … S 1 S n Action node Entity/Loc mention node Intra-structure edge Inter-structure edge Verb-centric structure Sentence structure Figure 2: An example of entity-action-location graph, constructed for paragraph "...Soft tissues quickly decompose leaving the hard bones or shells behind. As the encased bones decay, minerals seep in replacing the organic material... " model, which is then sent into a Bidirectional LSTM (Hochreiter and Schmidhuber, 1997) (BiL-STM) to obtain the contextual embedding for each token. Each node in our graph is associated with a text span in the paragraph. Therefore, the initial node representation is derived by mean pooling over all token embeddings in its corresponding text span. The contextual representation of node Graph Reasoning. We leverage a graph attention network (GAT) (Velickovic et al., 2018) for reasoning over the built graph. The network performs masked attention over neighbor nodes (i.e., connected with an edge) instead of all the nodes in the graph. We apply a two-layer GAT, which means each node can aggregate information from their two-hop neighbor nodes (nodes that can be reached within two edges).
In each GAT layer, we first extract a set of neighbor nodes N i for each node n i . The attention coefficients between node n i and its neighbour n j can be computed through a shared attention mechanism, where a ∈ R 2d and W ∈ R d×d are learnable parameters, and is the concatenation operation. We apply a LeakyReLU activate function and normalize the attention coefficients, Then, we aggregate the information from the neighbor nodes with multi-head attention to enhance the stability and efficiency. The aggregated feature for n i with a K-head attention can be represented as in the first layer, and in the second layer, where is the concatenation operation, σ is the sigmoid activate function, W k ∈ R d×d is learnable matrix for kth head in first layer, and W k ∈ R Kd×d is learnable matrix for kth head in second layer. α k ij and α k ij are calculated with the corresponding W k and W k , respectively.

Prediction Model
Inspired by NCET (Gupta and Durrett, 2019b), we track the state and location separately, by a state tracking and a location prediction module. Each module takes the representations of concerned nodes as input and outputs the prediction (i.e., state or location of an entity) at each time step. State Tracking. Given a paragraph P and an entity e, the state tracking module tracks the state of the entity for each sentence. We first generate the representations of all sentences for the entity.
Considering that actions are good state-changing signals, we concatenate the embeddings of entity mention node and action node in the sentence as representation at step t. That is, where x e t denotes the representation of entity e in sentence S t , h e t denotes the representation of the entity mention node n e in sentence S t , h v t denotes the representation of the action node n a connected with n e in sentence S t . If entity e is not mentioned in sentence S t , we use zero vector as representation of S t for e. Note if there are multiple mention nodes for the entity e in sentence S t , we take the mean pooling over all mention nodes as h e t . And we take similar approach for multiple actions.
We utilize a BiLSTM layer on the sequence of sentence embeddings. And a conditional random field (CRF) (Durrett and Klein, 2015) is applied on the top of the BiLSTM to make the final prediction. The loss function for the state tracking module is defined as where D is the training collection containing entityparagraph pairs, P y s t |P, e; θ G , θ st represents the predicted probability of gold state y s t in sentence S t given the entity e and paragraph P , θ G are parameters for graph reasoning and the text encoder, and θ st are parameters in state tracking module. Location Prediction. For the location prediction module, we first collect all the location mention nodes as location candidates set C. We add an isolated location node to represent the special location candidate '?', which means the location cannot be found in the paragraph. The representation of this node is randomly initialized and learnable during the training process. Given an entity e and location candidate l ∈ C, we represent the sentence S t as where h e t and h l t denotes the representation of the entity mention node and location mention node in sentence S t . If the entity or location candidate is not mentioned in sentence S t , we use a zero vector replacing h e t or h l t . We use a BiLSTM followed by a linear layer for the location predictor. The model outputs a score for each candidate at each step t. Then, we apply a softmax layer over all the location candidates' scores at the same step, resulting in a normalized probabilistic distribution. The location loss is defined as where P y l t |P, e; θ G , θ loc represents the predicted probability of gold location y l t for entity e in sentence S t , and θ loc are parameters for location prediction module.

Learning and Inference
We create a single graph for each paragraph, which stays unchanged once created. Then the graph reasoning module and state/location prediction module are jointly trained in an end-to-end manner. The overall loss is defined as where λ loc is the hyper-parameter to balance the state tracking and the location prediction loss. We perform inference in pipeline mode. Specifically, for each entity, we first apply the state tracking module to infer its state at each time step. Then we only predict its location at steps when its state is changed (i.e., the predicted state is create or move 4 ). And the locations of an entity with unchanged states can be inferred according to its locations in previous steps. Such pipeline fashion can increase consistency between states and locations of an entity than inferring location and state simultaneously.

Experiments
This section describes the evaluation results of REAL on two datasets (ProPara  and Recipes ). We also provide ablation study and case analysis to illustrate the effectiveness of graph-based reasoning. ProPara contains procedural texts about scientific processes, e.g., photosynthesis, fossil formulation. It contains about 1.9k instances (one entityparagraph pair as an instance) written and annotated by human crowd workers. We follow the official split  for train/dev/test set. The Recipes dataset consists of paragraphs describing cooking procedures and their ingredients as entities. We only use the human-labeled data in our experiment, with 80%/10%/10% of the data for train/dev/test, respectively. Detail statistics for the two datasets can be found in Table 1.

Datasets and Evaluation Metrics
We follow previous work's setting  and evaluate the proposed approach on two types of tasks on the ProPara dataset, documentlevel task and sentence-level task. Document-level task focuses on figuring out input entities, output entities, entity conversions, and entity movements by answering corresponding questions. More details can be found in the official script 5 . Following the official script, we evaluate models with averaged precision, recall, and F1 scores. In sentencelevel task, we need to answer three categories of questions: (Cat-1) Is entity e created (destroyed, moved) in the process? (Cat-2) When is e created (destroyed, moved)? (Cat-3) Where is e created (destroyed, moved from/to)? For this task, we take macro-average and micro-average of the score for three sets of questions as evaluation metrics 6 .
For the Recipes dataset, we take the same setting as (Zhang et al., 2020), where the goal is to predict the ingredients' location changes during the process. We take precision, recall, and F1 scores to evaluate models 7 .

Implementation Details
We use Bert base (Devlin et al., 2019) as encoder and reason with 3-heads GAT. Batch size is set to 16, and embedding size is set to 256. The learning rate r, location loss coefficient λ loc and dropout rate d are derived by grid searching with in 9 trials in r ∈ {2.5 × 10 −5 , 3 × 10 −5 , 3.5 × 10 −5 }, λ loc ∈ {0.2, 0.3, 0.4}, and d ∈ {0.3, 0.4, 0.5}. The implementation is based on Python and trained on a Tesla P40 GPU with Adam optimizer for approximately one hour (with approximately 112M parameters). We choose the best model with highest prediction accuracy on development set. Table 2 compares REAL with previous work on the ProPara data for both document-level and sentencelevel tasks. Our proposed approach consistently outperforms all previous models, which do not utilize external knowledge on all metrics. In particular, compared to DYNAPRO, it increases the document-level F1 score by 5.3%, and sentencelevel macro averaged accuracy from 55.4% to 58.2%. Without any external data, our approach achieves comparable results to KOALA, which extensively leverages rich external knowledge in ConceptNet and Wikipedia pages, demonstrating the effectiveness of exploiting the entity-actionlocation graph. We also compare REAL with the re-implemented NCET 8 on the Recipes dataset. As shown in 3, REAL also surpass the strong baseline by 3.2%. All these results verify the effectiveness of the proposed graph-based reasoning approach.

Ablations
We conduct an ablation study to testify the effectiveness of multiple components in our approach.   74.3 43.0 54.5 -----XPAD  70     Recipes, respectively. As shown in Table 4, removing the graph-based representation learning for location/state prediction decreases the F1 score by 3.1%/3.6%, the gap becomes 4.4% without any graph-based reasoning. We can get similar observations on the Recipes dataset, indicating that exploiting the paragraph's rich relations is critical for both state tracking and location prediction.

Analyses of Different Relations
To further illustrate the effectiveness of different types of relations, we conduct below analyses and present three cases with predictions of REAL with and without graph reasoning in Figure 5. First, to verify the effectiveness of action-entity relations in multi-verb sentences, we compare REAL of with and without graph reasoning on sen-  tences containing multiple (i.e., more than 2) verbs in Table 5. We figure out that graph-based reasoning increases the performance by 5.7%, indicating that accurately connecting entities and corresponding actions improves the prediction accuracy. For case 1 shown in Figure 5, the relation between the entity bone the action decay helps the model to correctly predict the state of bone as destroy since the action decay indicates destroy. However, without such accurate connection between bone and decay, the prediction model is very likely to be misled by other actions such as seep or replace. Second, we illustrate the impact of entity-entity relations by comparing our approach and baseline where the entity is not explicitly mentioned 9 . As shown in Table 5, REAL increase the accuracy by 4.8%, which indicates the effectiveness of our approach by modeling cross-entity relations. The second case in Figure 5 illustrates the effectiveness of using entity-entity relations. The entity bags is not explicitly mentioned in the sentence "Trashcan gets emptied into trash truck", and thus the baseline model cannot correctly predict its state and Case 2 sub-graph replace Figure 5: Examples of model predictions of our approach w/ (black) and w/o (red) graph reasoning. Corresponding sub-graph is plot on the right of the paragraph. Dotted rectangles in the sub-graph highlight key connections for correct prediction in graph-based reasoning.
location. However, connecting it to the entity trashcan which is derived in the first sentence, helps the model infer its state and location correctly. Third, as discussed in section 1, mentionmention connections might improve accuracy when there are multiple mentions for the same entity. The third case in Figure 5 shows how REAL utilizes relations between different mentions for the same entity. In the first sentence, the location of entity small image is not mentioned, which results in wrong location prediction when no graph reasoning is used. In contrast, the built graph connects this mention with preposition it in the second sentence where its location is revealed as retina. Therefore, our model correctly predicts small image's location by graph-based representation learning.

Error Analyses
We randomly sample 100 wrongly predicted examples and summarize them into the following types.
First, the ambiguity between similar entities makes it difficult to derive accurate representations for them. For instance, fixed nitrogen and gasbased nitrogen are two different entities related to nitrogen in the paragraph "Nitrogen exists naturally in the atmosphere. Bacteria in soil fix the nitrogen. Nitrogen is now usable by living things.". It is difficult for a model to distinguish which entity the mention nitrogen refers to.
Second, commonsense knowledge is required. For example, it is difficult to infer the location of the entity bone in the sentence "An animal dies. It is buried in a watery environment." without the knowledge "bone is part of animal". Therefore, injecting appropriate external knowledge while avoiding noise may improve the model. Third, similar actions indicate different states in different contexts. For instance, in sentence "the tree eventually dies.", the state of tree is labeled as destroy, while in sentence "most fossils formed when animals or plants die in wet environment.", the state of animals and plants are all annotated as exist, which may confuse the model.