Learning Latent Structures for Cross Action Phrase Relations in Wet Lab Protocols

Wet laboratory protocols (WLPs) are critical for conveying reproducible procedures in biological research. They are composed of instructions written in natural language describing the step-wise processing of materials by specific actions. This process flow description for reagents and materials synthesis in WLPs can be captured by material state transfer graphs (MSTGs), which encode global temporal and causal relationships between actions. Here, we propose methods to automatically generate a MSTG for a given protocol by extracting all action relationships across multiple sentences. We also note that previous corpora and methods focused primarily on local intra-sentence relationships between actions and entities and did not address two critical issues: (i) resolution of implicit arguments and (ii) establishing long-range dependencies across sentences. We propose a new model that incrementally learns latent structures and is better suited to resolving inter-sentence relations and implicit arguments. This model draws upon a new corpus WLP-MSTG which was created by extending annotations in the WLP corpora for inter-sentence relations and implicit arguments. Our model achieves an F1 score of 54.53% for temporal and causal relations in protocols from our corpus, which is a significant improvement over previous models - DyGIE++:28.17%; spERT:27.81%. We make our annotated WLP-MSTG corpus available to the research community.


Introduction
Wet laboratory protocols (WLPs) play an integral role in bioscience and biomedical research by serving as a vehicle to communicate experimental instructions that allow for standardization and replication of experiments. These procedures, typically written in natural language, prescribe actions (Figure 1) to be conducted on materials that generally 1 The dataset and code is available on the authors' websites Figure 1: Extraction of a MSTG from an example WLP. The MSTG is composed of Action Graphs (in grey), connected by temporal and causal relationships (e.g., temporal relation "Site" between Remove and Add).
The arrow indicate the direction of material flow.
produce new materials which, in turn, are used by future actions to make newer materials. However, WLPs can be unclear, composed of disconnected and distant parts, and built upon implicit information that were referenced earlier or omitted entirely. Lack of careful documentation has led to a reproducibility crisis (Baker, 2016) in the biosciences and also poses considerable challenges for automation of laboratory procedures: gleaning the effect and semantics of actions requires understanding the underlying experiment, the sentence structure and rationale behind implicitly stated arguments.
Currently, there is a dearth of annotated resources for natural language instructions in laboratory protocols. The WLP corpus initially collected by Kulkarni et al. (2018) and later updated by Tabassum et al. (2020) focused solely on relations within sentences. However, actions in WLPs are more complex, containing additional relations between actions (e.g., temporal and causal rela-tions). We propose using material state transfer graphs (MSTG), which are a natural extension of Action Graphs (Kulkarni et al., 2018). MSTGs link together several Action Graphs into a larger structure by utilizing global temporal and causal relationships that can span several sentences in order to describe the flow of materials from action to action (Section 3). An example of a MSTG is shown in Figure 1. The action phrase Grow the bacteria overnight in Step 1 consists of an action Grow that Acts-on the reagent bacteria for an amount of time specified as overnight. This Action Graph is then connected to other such graphs (like in Step 5) through temporal and causal relationships (e.g., Grow action's product is host culture thus we use a Product link to establish a temporal relation between Step 1 and Step 5).
To automate the generation of MSTGs, we must overcome two distinct challenges prevalent in WLPs. First, the result of a preceding step may not be immediately used by the next step, resulting in long-range dependencies. Second, an action may involve implicit information, which is either mentioned earlier or omitted entirely. Current models usually fail to make accurate predictions for long-range relations, as seen in Figure 1 when establishing a temporal relation between Step 1 and Step 5. These methods rely on relation propagation (DyGIE++ Wadden et al. (2019)) or use contextual embeddings (spERT Eberts and Ulges (2019)). Furthermore, neither successfully establish complex relations involving implicit arguments. In Step 5, the host culture and viral concentrate must be added to the tube containing soft agar that was removed in Step 4. However, the location tube in Step 5 is implicit and has to be correctly inferred to make the Site relation between Remove and Add.
We propose a novel and effective neural network model that: (i): uses a series of relational convolutions to learn from relations within and across multiple action phrases and (ii): iteratively enriches entity representations with learned latent structures using a multi-head R-GCN model. Our model achieves an F1 score of 54.5% for temporal and causal relations, significantly improving upon previous methods DyGIE++ and spERT for such long-range relations by 26.4% and 26.7% respectively. We analyze our model for intra-and intersentence relation extraction and show substantial improvements. Further, we also show the model's ability in resolving implicit arguments to improve temporal relation extraction over the best baseline method by 23.3%. This paper is organized around two main contributions: (i): the WLP-MSTG Corpus that extends the WLP Corpus (Kulkarni et al., 2018) by including intra-and cross-sentence temporal and causal relationships and (ii): a novel model that builds upon latent structures to resolve implicit arguments and long-range relations spanning multiple sentences. In Section 2, we describe related works and in Section 3, we introduce MSTGs highlighting the two challenges. Next, we describe our proposed model in Section 4 and demonstrate its performance in Section 5.

Related Work
Temporal and Causal Relation Extraction: Prior efforts have shown great promise in learning local and global features (Leeuwenberg and Moens, 2017;Ning et al., 2017). Neural-networkbased methods have proven effective (Meng et al., 2017;Meng and Rumshisky, 2018). Notably, Han et al. (2019) use neural support vector machine which can be difficult to train. Early methods for extracting causal relations resorted to feature engineering (Bethard and Martin, 2008;Yang and Mao, 2014). Recently several researchers (Zeng et al., 2014;Nguyen and Grishman, 2015;Santos et al., 2015) used convolutional neural networks (CNNs) for extracting causal features. Notably, Li and Mao (2019) addressed scarcity of training data thorough knowledge-based CNN. However, such methods are not scalable to multiple sentences.
Cross Sentence Relation Extraction: Long range relations are understudied in literature. Prior work focused on relations within a sentence or at best between pairs of sentences (Peng et al., 2017;Lee et al., 2018;Song et al., 2018;Guo et al., 2019). In addition to joint entity and relation extraction models, Wadden et al. (2019) proposed a model that passes useful information across graphs over cross-sentence contexts while Eberts and Ulges (2019) encoded per sentence contextual information for relation extraction over longer sentences.
Implicit Arguments: Early methods selected specific features to build linear classifiers (Gerber andChai, 2010, 2012). Others incorporated additional, manually-constructed resources like named entity taggers and WordNet (Gerber and Chai, 2012;Laparra and Rigau, 2013;Fellbaum, 2012). In contrast, a few notable studies used unlabeled training data to resolve implicit arguments (Chiarcos and Schenk, 2015;Schenk et al., 2016). Finally, Do et al. (2017) explored the full probability space of semantic arguments; however, the method does not scale well.

Task Formulation: Material State
Transfer Graph To construct a MSTG from an input protocol, we define the following four concepts. (i) Action Graphs: Introduced by Kulkarni et al. (2018), they are extracted from action phrases as seen in Figure 1. Forming the fundamental unit of a MSTG, Action Graphs are composed of an Action, 17 types of named entities as explicit arguments (e.g, "Reagent", "Location", etc.), and 13 local semantic relations (e.g., "Using", "Measure", "Acts-on",  etc.) represented as directed edges, which we shall refer to as inter-Action Phrase (iAP) relations hereafter. (ii) Temporal Relations: Inspired from prior work (Allen, 1984), we define temporality as a relationship between two action phrases such that an action's product (output) is connected to another action's source (input), thereby imposing a partial or total order. It is also necessary to determine whether an action is executed before or simultaneously with respect to other actions. We use 5 temporal relations, (namely "Acts-on", "Site", "Coreference", "Product", and "Overlaps") to capture the flow of materials. (iii) Causal Relations: Following (Barbey and Wolff, 2007), we define causality as the relationship between two actions where one action directly affects the execution of another action (e.g., if a given action enables or prevents 2 another action). (iv) Implicit Arguments: We characterize implicit arguments into four cases ( Figure 2a) depending on whether the source or product of the connected actions is implicit or explicit. Four of the five temporal relations in WLP-MSTG are defined to handle implicit arguments: "Acts-on", "Site", "Coreference", and "Product".

Corpus for Cross Sentence Relations
Annotation Process: We annotate six-hundredand-fifteen (615) protocols derived from the WLP Corpus to include the 6 global cross-Action Phrase Temporal and Causal (cAP-TaC) relationships. We split the annotation task into two phases. In the first phase, we worked with 7 expert annotators to develop the guidelines over 8 iterations. Each iteration consisted of 10 protocols that were individually annotated by each expert annotator, and the inter-annotator agreement (IAA) was measured for each of the 10 protocols. At the end of each iter-  ation, we refined the set of rules to reduce the guidelines' ambiguity. The agreement measured across all annotators using Krippendorff's Alpha (Krippendorff, 2004) on the last iteration was 78.23%.
With a good IAA attained, we began the second phase to collect the train, dev, and test datasets. To ensure the highest quality of the test data, we employed all 7 annotators to work on the same 128 protocols and merged the resulting annotations based on majority voting. In contrast, individual annotators collected the train and dev sets separately to speed up the annotation process. A typical protocol of 30 steps required 25 minutes on average for an annotator to identify all the cAP-TaC relations.
Comparison with previous corpora: Our corpus, WLP-MSTG, extends the WLP corpus (Kulkarni et al., 2018) which was later updated for a WNUT 2020 shared task (Tabassum et al., 2020). WNUT 2020 was primarily designed to facilitate supervised named entity taggers and withinsentence relation extraction methods. We extend the 615 protocols therein to include intra-and intersentence temporal and causal relations. To ensure a fully connected graph, we exclude entities and relations annotated for spurious descriptive sentences that do not prescribe any actions (e.g., title, notations, definitions, etc.). Table 2 provides a comparison of statistics among the three corpora.
Analysis: We conducted a distribution analysis of 90 protocols that would typically serve as the dev set for machine learning models. Actions connected by temporal and causal relations tend to be consecutive (78.4%); however, a non-trivial number are considerably spaced apart (21.6%) with 1.08% of the total at least 8 actions apart. For implicit arguments, we observed: (i) implicit arguments are unusually prevalent in WLPs (88.44%), (ii) a higher percentage (55.98%) of the products of an action are implied, and (iii) temporally connected actions are closer if they contain implicit arguments; otherwise, they are relatively farther apart Figure 2b. This analysis provides valuable insight about the challenges in the form of long-range relations and implicit arguments that are present in extracting MSTGs from WLPs.

A Latent Structure Model for Joint Entity and Relation Extraction
We develop a latent structure model for jointly learning entity and relations within and across multiple sentences. A schematic of the model is shown in Figure 3. In Section 4.1 we describe construction of span representation ( Figure 3A) from protocol text that incorporates critical features necessary for long-range relation extraction. Section 4.2 explains how the transcoder block ( Figure 3B) builds upon latent structures (as illustrated in Figure 3D) to improve entity and relation representations. Finally, in Section 4.3 we discuss training and regularization strategies to jointly learn span, entity, and relations through a multi-task loss function derived from span, entity, and relation scores ( Figure 3C). We shall use Figure 1 as a running example throughout the model description.

Span Representation
Following prior span-based approaches (Wadden et al., 2019; Eberts and Ulges, 2019), our goal is to (i): collect a series of tokens from the protocol text, (ii): enumerate all spans, and (iii): rank topscoring spans for considerations as candidates for entity and relation extraction.
Token embeddings: We use SciBERT (Beltagy et al., 2019) for learning token representations for a given protocol P. As shown in Figure 3, the input is a protocol P represented as a collection of sentences S = {s 1 , ..., s P }. Each sentence s i is composed of a sequence of tokens {t 1 , ..., t n }. For example, within the sentence, Add 1.0 mL host culture and either 1.0 or 0.1 mL viral concentrate (Figure 1, Step 5), we identify host, culture, and etc., as the tokens to be passed to the SciBERT model. We batch process sentences in the protocol to generate context-aware embeddings {t 1 , ..., t n } for each sentence.
Span Enumeration: The spans between two tokens t i and t j is represented as We enumerate all possible spans of upto a size of 10 tokens. For each enumerated span, the span representation e ij ∈ R de is derived from  applying a feed-forward neural network (FFNN) on a concatenation of tokens representations and embeddings: where, t i and t j are the first and last token representation. Note, φ sh (s ij ) is a soft head representation (Bahdanau et al., 2014) and, φ w (s ij ) is a learnt span width embedding respectively. Further, φ pos (s ij ) and φ step (s ij ) are two positional embeddings, the former for within sentence while the latter defines the step position within the protocol respectively. Hence, host culture and host culture and are two valid spans that are enumerated through this process.
Span Pruning: Next, low scoring spans are filtered out during both training and evaluation phases. Following (Lee et al., 2017), the scoring function is implemented as a feed-forward network φ s (e ij ) = w T s FFNN s (e ij ). We rank and pick a number of top scoring spans per sentence by using a combination of (i): a maximum fraction λ p = 0.1 of spans per sentence, and (ii): a minimum score threshold λ t = 0.5. Thus, the span host culture receives a significantly higher score than host culture and, indicating that the former is the correct reagent entity in the prescribed step. These span candidates are then passed to the transcoder block.

Transcoder Block
In the transcoder block, we propose a novel architecture to improve relation and entity representation from latent structures. The objective is two fold: (i): to leverage localized features at phrase and sentence levels to resolve long range relations through a relation convolutions, and (ii): to learn from latent structures how to resolve implicit arguments through a multi-head relational graph convolution network (multi-head R-GCN).
Each transcoder block is composed of a Relation Encoder (Section 4.2.1), Convolution (Section 4.2.2) and Decoder (Section 4.2.3) components, to discover local relationships between the input entities. These relations (represented as latent structures A ∈ R m×m×r ) are then passed to the Multi-Head R-GCN (Section 4.2.4) component of the same transcoder block to enrich the entity representation with information about those discovered local relationships. These enriched entities can now be used to predict more complex cross sentence relationships in the next transcoder block. To facilitate deeper networks, we make use of residual connections (He et al., 2016) followed by layer normalization (Ba et al., 2016) as denoted by Add + Norm in Figure 3B.
We shall make use of the example (Figure 1), focusing on the long range relationships between Step 1 (i.e., Grow the bacteria overnight.) and Step 5 (i.e., Add 1.0 mL host culture and either 1.0 or 0.1 mL viral concentrate.) to illustrate the flow of information throughout the transcoder block. The first transcoder block takes as input m high scoring candidate entity span representations (as E (0) ∈ R m×de ) as determined by the pruner 3 . For instance, from Step 1 we identify the following high scoring candidate entities grow, bacteria, and overnight and from Step 5 we find add, 1.0 mL, host culture, 0.1 mL, and viral concentrate.

Relation Encoder:
Following (Nguyen and Verspoor, 2019), we make use of a bi-affine pairwise function to encode relations for every pair of entity span representation. That is, we generate relational embeddings for entity pairs like grow and bacteria, grow and overnight, etc. Each entity span e ij ∈ R de is first projected using two FFNNs to generate the representations e h ij ∈ R d h and e t ij ∈ R dt indicating the first (head) and the second (tail) argument of a relation: In practice, we batch process all entities to generate E h ∈ R m×d h and E t ∈ R m×dt where m is the number of candidate spans. In our experiments, we let d h = d t then use a bi-affine operator to calculate a tensor R (l) ∈ R m×dr×m for relational embeddings: R (l) = (E h L)E T t . Here L ∈ R d h ×dr×dt is a learned parameter tensor and d r is the relation embedding size.

Relation Convolutions:
We enrich the relational embeddings R (l) with local relational features within a single phrase (found near the diagonal) and across multiple phrases (found in the upper and lower triangle) using a stack of convolutional layers. We denote C w (.) to be a 2D convolutional operator applying a kernel width of size w × w. In our model, we make use of a two-layer convolution: The input R (l) is reshaped as R m×m×dr such that the dimensions d r acts as the channel dimension in the convolutions. The dimensions of T (0) is in R m×m×2dr with the final output R (l) ∈ R m×m×dr . 3 The entity span representation from the entire subprotocol, (i.e., from steps 1 to 5), are passed as a bag of entities E (0) ∈ R m×de . However, there aren't any relations (i.e., R (0) ) to be passed to the first transcoder block

Relation Decoder:
The relational embeddings R (l) are decoded using a 2-layer FFNN. The decoded scores A ∈ R m×m×r captures the latent structures (as shown in Figure 3B). This is re-encoded using the multi-head R-GCN to strengthen the model's ability to predict more complex relations in the next transcoder layer.

Multi-head R-GCN:
For each predicted relation score A r ∈ R m×m , we add self loops and perform Laplacian smoothing (Kipf and Welling, 2017;Li et al., 2018) for normalization following:Â r = D − 1 2 A r D − 1 2 where A r = A r + I andD = j A ijr . Then, usinĝ A r as an adjacency matrix, we learn multi-head, direction-specific graph convolution transformations. Each head corresponding to a given relation r performs graph convolutions on the entity representation E (l−1) ∈ R m×de to generate E (l) r ∈ R m×(dr/r) . A single R-GCN (i) r (.) (Schlichtkrull et al., 2018) operation for a given relation type r and i th GCN layer corresponds to: br ∈ R d i−1 ×d i are learnable parameters for incoming and outgoing edge directions respectively and b (i) r is the bias. We use the ReLU activation function σ in our networks. As shown in Figure 3B, the outputs of the individual R-GCN heads are concatenated and passed through a FFNN layer to compute the final output E (l) .
For instance, suppose we discovered a local relation in Step 1 between grow and bacteria after the Relation Decoder component in the first transcoder block. The Multi-head R-GCN takes in the discovered relation (through the latent structure A) and enriches grow's entity embeddings, enabling the next transcoder layer to predict a more complex cross sentence relation between grow (Step 1) and host culture (Step 5). Since bacteria and host culture are semantically related, they have similar entity embeddings, and therefore the enriched representation of grow (now containing information about bacteria) allows for establishing the relation between grow and host culture in the next transcoder block.

Training and Regularization
The loss function is a linear combination of cross entropy losses for each of the tasks. We additionally apply label smoothing (Szegedy et al., 2016). The relation extraction is trained on gold entity spans. For regularization, we apply dropout (Srivastava et al., 2014) to the output of each FFNN layer. We make use of dropedge (Rong et al., 2019) for the adjacency matrix A r before it is passed to the multi-head R-GCN model.

Experiments
In contrast to general language models, domainspecific methods have resulted in more competitive baselines and are better suited (Tabassum et al., 2020;Wadden et al., 2019;Eberts and Ulges, 2019) for simultaneously resolving and predicting entities and relations over longer contexts. Thus, we evaluate our model against two state-of-the-art models for jointly predicting entities and relations in scientific-text domain, namely DyGIE++ (Wadden et al., 2019) and spERT (Eberts and Ulges, 2019), on the WLP-MSTG.
We conduct five (5) runs with random initializations for each evaluation and report the test set performance on the model that achieved the median relation F1 score on the dev(elopment) set. All models are evaluated end-to-end, where the model takes as input tokenized sentences and predicts all the entities and the relations generating a MSTG. We use the standard precision, recall and F1 metrics. An entity is considered correct if its predicted span and label match the ground truth. Relation extraction is performed on the predicted entity spans. A relation is correct if its relation type and the entity pairs are both correct (in span and type) against the ground truth. We also evaluate our model's performance on WNUT 2020 (Tabassum et al., 2020) corpus. To fairly evaluate relation extraction, we use gold entities to make relation predictions 4 by modifying the loss function to only train on relation scores. We additionally concatenate entity label embeddings to the span representation in Equation (1).

Results
On the WLP-MSTG corpus, Table 3 shows our best model with N = 8 transcoder block layers making modest improvement on entity extraction at 82.0% but improving significantly upon the previous stateof-the-art methods (i.e. DyGIE++ and spERT) in predicting relations. Our model outperforms the baselines for relation extraction with an F1 score on predicting inter-Action Phrase (iAP) relations at 68.0% and cross-Action Phrase Temporal and Causal (cAP-TaC) relations at 54.5%. We further enhanced the performance of our model by sharing the relational decoders' parameters across all layers of the transcoder block (Section 4.2.3). This enables the latent structures to be grounded in output relation types, which also lends itself to be interpretable. The shared relation decoder marginally outperforms the not-shared configuration by 0.5% for iAP relations and 1.1% for cAP-TaC relations.
Short and Long Range Relations: On the WNUT 2020 corpus, which only includes intrasentence relations, Table 4 shows that our model outperforms the best single model that used the original data by 1.0%. We also report that our model is competitive against the ensemble approach that included models trained on an altered version of the original corpus where they removed duplicate text after clustering. On the WLP-MSTG corpus, we can evaluate both short and long range relations: from Table 3 we see a 3.5% improvement in F1 score over DyGIE++ for iAP relations. This shows that our model leverages the cross-sentence temporal and causal relations that were additionally annotated in WLP-MSTG to improve local iAP relations. Our model outperforms DyGIE++ and spERT on intra-sentence by 4.3% and 26.1% respectively, and significantly improves for intersentence cAP-TaC relations by 45.5% and 21.5% respectively. This is attributed to positional embeddings along with the relational convolutions which enables the model to learn intra and inter action phrase relations effectively. We see spERT performing better for "Overlaps" which is largely attributed to the 'CLS' token that spERT embeds to make relation predictions. Figure 4 shows performance on varying the number of sentences in between entities involved in a relation. We observe our model performing the best for all distances between sentences. This is once again attributed to the relational convolution component which is effective in capturing far away relations.
Temporal and Implicit Arguments: In Table 6 we show our model outperforming the baselines for    temporal relations at 53.4% F1 score. We also observe significant improvements across the board for resolving implicit arguments. We see the highest gains (at 55.6%) compared to the baseline models (1.6% for DyGIE++ and 10.2% for spERT) for (E-I) case (Figure 2a) which only contains 169 samples in the test set. Our model is able to correctly resolve the implicit source (input) to an action by utilizing simple relations that is typically connected to explicit arguments.
Causal relations: The performance for causal relations for our model against DyGIE++ is comparable as seen in Table 6. Causal relations are relatively easier for the baseline models to capture, as they tend to have specific prepositions in be-  performance gain compared to DyGIE++ and about 10.9% improvement against spERT. This is primarily attributed to the multi-head R-GCN which builds upon simple relations that provide clues to establish harder causal relations. Cross-sentential 'Enables' relations (as seen in Table 5) are challenging even for our model as once again we do not encode any contextual features.
Model Ablation:   multi-head R-GCN) play a significant role in improving cAP-TaC performance. Relation convolutions contributes the most to iAP and cAP-TaC relations by about 1.2% and 2.4% respectively. Positional embeddings impacts iAP relations more (by 1.1%) whereas Multi-Head R-GCN only impacts the more complex relations (cAPTaC by 1.1%) and does not help in improving simpler relations.
How Many Layers?: Figure 5 shows that more layers generally improve far away relations without improving closer ones. This shows that although our model can build upon simple relations that are typically close by, it cannot do the opposite, i.e.,  leverage far away relations (which are typically more complicated) to improve more challenging closer relations. Our model discovers those complex, distant relations too deep into the network to be utilized to predict the challenging local relations.

Conclusions and Future Work
We present the WLP-MSTG corpus, an extension of the WLP corpus that includes cAP-TaC relationships for building MSTGs. This corpus highlights two unique challenges: (i) the implicit argument problem and (ii) long-range relations. To address these issues, our model builds upon latent structures thus outperforming previous state-of-the-art models for predicting iAP and cAP-TaC relations.
We also report significant improvements in understanding implicit arguments and identifying long range relationships across multiple sentences. However, our model's lower absolute performance indicates that we have not fully captured the information needed to facilitate modeling end-to-end workflows, which will have a lasting impact in improving automation in the life sciences and other domains.

References
James F Allen. 1984. Towards a general theory of action and time. Artificial intelligence, 23 (2) simply negating the "Action" involved in the relationship. For example, Mix regents carefully to not spill contents, we replace a "Prevents" relation from Mix to spill with an "Enables" relation from Mix to not spill. In many elaborate negative words we make use of "Mod-Link" to connect to the additional descriptors to the relevant action.

Implementation Details
In evaluating on WLP-MSTG, we overcome memory limitations in baseline models during training and inferencing, we sub-divide long protocols into overlapping windows of 5 sentences each, with a stride of 2 (i.e., each consecutive window shares 3 sentences). To ensure fair comparison we also incorporate this restriction to our model, although our models is capable of a much larger window size. The final evaluation is done by merging the predictions in the form of sub-graphs into one complete material state transfer graph (MSTG) and resolving duplicate predictions through majority voting. We identify duplicates through exact match of spans boundaries for entities and exact match of entity span and its types for relations.
Hyperparameters We make use of Adam optimizer with a initial learning rate of 2.13 × 10 −5 . For generating span candidates we only enumerate them upto 10 tokens in width. We set the positional embedding φ pos (s ij ) table size to 100. For step embedding φ step (s ij ) we only learn embeddings for 5 steps. Both embeddings use embedding dimensions as 50. The span embedding size d e = 340, and the relational embedding size d r is set to 100. Label smoothing [symbol] is set to the default value of 0.1. Dropout used in every FFNN has p = 0.2 and the dropedge used right before multi-head R-GCN model is set with p = 0.5.