A Graph-Based Neural Model for End-to-End Frame Semantic Parsing

Frame semantic parsing is a semantic analysis task based on FrameNet which has received great attention recently. The task usually involves three subtasks sequentially: (1) target identification, (2) frame classification and (3) semantic role labeling. The three subtasks are closely related while previous studies model them individually, which ignores their intern connections and meanwhile induces error propagation problem. In this work, we propose an end-to-end neural model to tackle the task jointly. Concretely, we exploit a graph-based method, regarding frame semantic parsing as a graph construction problem. All predicates and roles are treated as graph nodes, and their relations are taken as graph edges. Experiment results on two benchmark datasets of frame semantic parsing show that our method is highly competitive, resulting in better performance than pipeline models.


Introduction
Frame semantic parsing (Gildea and Jurafsky, 2002) aims to analyze all sentential predicates as well as their FrameNet roles as a whole, which has received great interest recently. This task can be helpful for a number of tasks, including information extraction (Surdeanu et al., 2003), question answering (Shen and Lapata, 2007), machine translation (Liu and Gildea, 2010) and others (Coyne et al., 2012;Chen et al., 2013;Agarwal et al., 2014). Figure 1 shows an example, where all predicates as well as their semantic frame and roles in the sentence are depicted.
Previous studies Swayamdipta et al., 2017;Bastianelli et al., 2020) usually divide the task into three subtasks, including target identification, frame classification and semantic role labeling (SRL), respectively. By performing the three subtasks sequentially, the whole frame semantic parsing can be accomplished. The majority of Figure 1: An example involving frame semantic structures, taken from the FrameNet (Baker et al., 1998). Frame-evoking predicates are highlighted in the sentence, and corresponding frames are shown in colored blocks below. The frame-specific roles are underlined with their frames in the same row.
works focus on either one or two of the three subtasks, treating them separately (Yang and Mitchell, 2017;Botschen et al., 2018;Peng et al., 2018).
The above formalization has two weaknesses. First, the individual modeling of the three subtasks is inefficient to utilize the relationship among them. Apparently, the earlier subtasks can not exploit the information from their future subtasks. Second, the pipeline strategy can suffer from the error propagation problem, where the errors occurring in the previous subtasks can influence the later subtasks as well. To address the two weaknesses, end-to-end modeling is one promising alternative, which has been widely adopted in natural language processing (NLP) (Cai et al., 2018;Sun et al., 2019;Fu et al., 2019;Fei et al., 2020).
In this work, we propose a novel graph-based model to tackle frame semantic parsing in an endto-end way, using a single model to perform the three subtasks jointly. We organize all predicates and their FrameNet semantic by a graph, and then design an end-to-end neural model to construct the graph incrementally. An encoder-decoder model is presented to achieve the graph building goal, where the encoder is equipped with contextualized BERT representation (Devlin et al., 2019), and the decoder includes node generation and edge building sequentially. Our final model is elegant and easy to understand as a whole.
We conduct experiments on two benchmark datasets to evaluate the effectiveness of our proposed model. First, we study our graph-based framework in two settings, the end-to-end scenario and the pipeline manner, where the node building and edge building are trained separately. Results show that end-to-end modeling is much better. Besides, we also compare our model with several other pipelines, where the similar findings can be observed. Second, we compare our graph-based framework with previous methods by the three subtasks individually, finding that the graph-based architecture is highly competitive. We can obtain the best performance in the literature, leading to a new state-of-the-art result. Further, we conduct extensive analyses to understand our method in depth.
In summary, we make the following two major contributions in this work: (1) We propose a novel graph-based model for frame semantic parsing which can achieve competitive results for the end-to-end task as well as the individual subtasks.
(2) To the best of our knowledge, we present the first work of end-to-end frame semantic parsing to solve all included subtasks together in a single model. We will release our codes as well as experimental setting public available on https://github. com/Ch4osMy7h/FramenetParser to help result reproduction and facilitate future researches.

Related Work
Frame-Semantic Parsing Frame-semantic parsing has been received great interest since being released as an evaluation task of SemEval 2007 (Baker et al., 2007). The task attempts to predict semantic frame structures defined in FrameNet (Baker et al., 1998) which are composed of frame-evoking predicates, their corresponding frames and semantic roles. Most of the previous works Swayamdipta et al., 2017;Bastianelli et al., 2020) focus on a pipeline framework to solve the task, training target identification, frame classification and semantic role labeling models separately. In this work, to the best of our knowledge, we present the first end-to-end model to handle the task jointly.
Among the three subtasks of frame semantic parsing, semantic role labeling has been researched most extensively (Kshirsagar et al., 2015;Yang and Mitchell, 2017;Peng et al., 2018;Marcheggiani and Titov, 2020). It is also highly related to the Propbank-style semantic role labeling (Palmer et al., 2005) as while with only differences in the frame definition. Thus the models between the two types of semantic role labeling can be mutually borrowed. There are several end-to-end Propbank-style semantic role labeling models as well (Cai et al., 2018;Fu et al., 2019). However, these models are difficult to be applied directly for frame semantic parsing due to the additional frame classification as well as the discontinuous predicates. In this work, we present a totally-different graph construction style model to solve end-to-end frame semantic parsing elegantly.

Graph-Based Methods
Recently, graph-based methods have been widely used in a range of other tasks, such as dependency parsing (Dozat and Manning, 2016;Kiperwasser and Goldberg, 2016;Ji et al., 2019), AMR parsing (Flanigan et al., 2014;Lyu and Titov, 2018;Zhang et al., 2019a,b) and relation extraction (Sun et al., 2019;Fu et al., 2019;Dixit and Al-Onaizan, 2019). In this work, we aims for frame semantic parsing, organizing the three included subtasks by a well designed graph, converting it into graph-based parsing task naturally.

Task Formulation
The goal of frame-semantic parsing is to extract semantic predicate-argument structures from texts, where each predicate-argument structure includes a predicate by a span of words, a well-defined semantic frame to express the key roles of the predicate, and the values of these roles by word spans. Formally, given by a sentence X with n words w 1 , w 2 , . . . , w n , frame-semantic parsing aims to output a set of tuples Y = {(y 1 , y 2 , . . . , y k )} K k=1 , where each y i consists of the following elements: where r i, * are frame roles derived from f i and v i, * are also word spans in X. The full frame semantic parsing is usually divided into the following three subtasks: • Target Identification (also known as predicate identification), which is to identify all valid frame-evoking predicates from X, outputting P = {(p 1 , ..., p k )}. • Frame Classification, which is to predicate the concrete evoking frame f i of a certain predicate p i ∈ P . • Semantic Role Labeling, which is to assign concrete values for roles r i by given a predicate frame pair (p i , f i ). Previously, the majority of work of frame semantic parsing performs the three subtasks individually, ignoring their highly-related connections and also being vulnerable to the error propagation problem. Thus, we present an end-to-end graph-based model to accomplish the three subtasks by a single model.

The Graph-Based Methodology
We formalize the frame-semantic parsing task as a graph constructing problem, and further present an encoder-decoder model to perform the task in an end-to-end way. The encoder aims for representation learning of the frame semantic parsing, and the decoder constructs the semantic graph incrementally. Concretely, for the encoder, we compute the span representations since the basic processing units of our model are word spans, and for the decoder, we first generate all graph nodes, and then build edges among the graph nodes. Figure 2 shows the overall architecture of our method.

Encoding
Due to the strong capability of BERT (Devlin et al., 2019) for represent learning, we adopt it as the backbone of our model. Given a sentence X = {w 1 , w 2 , ..., w n }, BERT converts each word w i into word pieces, and feed them into deep transformer encoders to get the piece-level representation. To obtain word-level representation, we average all piece vectors of word w i as its final representation e i .
For further feature abstraction, we exploit BiHLSTM (Srivastava et al., 2015) to compose high-level features based on word-level output e 1 , · · · , e n , following : where the gated highway connections are applied to BiLSTMs.

Span Representation
We enumerate all possible spans S = {s 1 , s 2 , . . . , s m } in a sentence and limit the maximum span length to L. Then, each span s i ∈ S is represented by: where φ(g i ) represents the learned embeddings of span width features, h ATTN is computed by selfattention mechanism which weights the corresponding vector representations of the words in the span by normalized attention scores, and START(i) and END(i) denote start and end indices of s i .

Node Building
Node Generation We exploit a preliminary classification to achieve the goal of node generation. First, a span can be either a graph node or not. Further, a graph node can be a full or partial predicate node, and the node can also be a role node. Totally, we define four types for a given span: • FPRD: a full predicate span.
• ROLE: a role span.
• NULL: a span that is not a graph node. The type of a span can be the full permutation of elements in set {FPRD, PPRD, ROLE} or NULL. Thus, each span can be classified into eight types (i.e., FPRD, PPRD, ROLE, Given an input span s i with its vectorial representation as g i , we exploit one MLP layer with softmax to classify the span type: where p n indicates the probabilities of span types. By this classification, all non-null type spans are graph nodes, reaching the goal of node generation.

Frame Classification of Predicate Nodes
Node generation detects all graph nodes roughly, assigning each node with a single label to indicate whether it can be served as a predicate or role. Here we go further to recognize the semantic frames for all predicate nodes, which could be regarded as an in-depth analysis for node attribution. The step is corresponding to the frame classification subtask. Given an input span s i of a predicate node (FPRD or PPRD), assuming its representation being g i , we use another MLP layer together with softmax to output the probabilities of each candidate frame for the predicate node: where p c is the output probabilities of semantic frames. Specially, frames are constrained by the lexical units defined in FrameNet. For example, the predicate with the lexical unit "meeting" only evokes frame Social_event and Discussion.
We also adopt the pseudo strategy following Swayamdipta et al. (2017) to optimize the classification. First, we use spacy lemmatizer (Honnibal et al., 2020) to translate an input sentence into lemmas. Then, if a word span is a predicate node, we treat the corresponding lemma span as the pseudo lexical unit and index the corresponding semantic frame set by it. Finally, we reduce the search space by masking frames outside the set. In our experiments, we find it is practical to apply this strategy.

Edge Building
After graph nodes are ready, we then build edges to accomplish frame semantic parsing accordingly. There are two types of edges in our model.

Predicate-Predicate Edge
For extracting discontinuous mentions, we build the edges between nodes which are predicate fragments (i.e., PPRD nodes). In detail, we treat it as a binary classification problem considering whether two nodes alongside the edge can form parts of a predicate or not. Formally, given two PPRD nodes with the corresponding spans s p i and s p j and their encoding representations g p i and g p j , we utilize one MLP layer to classify their edge type: where p pe indicates the probabilities of two types, namely Connected and NULL (i.e., cannot be connected), and the feature representation is borrowed from Zhao et al. (2020).

Predicate-Role Edge
For extracting framespecific roles, we build the edges between predicates nodes (i.e., node type by FPRD or PPRD) and role nodes (i.e., node type by ROLE). Given a predicate node s p i and a role node s r j , assuming their neural representations being g p i and g r j , respectively, we utilize another MLP layer to determine their edge type by multi-class classification: where p re indicates the probabilities of predicaterole edge types (i.e., frame roles as well as a NULL label indicating no relation).

Joint Training
To train the joint model, we employ the negative log-likelihood loss function for both node building and edge building step: where y n and y c are the gold labels for the text spans and predicate nodes, y pe and y re indicate the gold edge labels for the predicate-predicate and predicate-role node pairs. Further, losses from two steps are summed together, leading to the final training objective of our model:

Decoding
The decoding aims to derive frame semantic parsing results by the graph-based model. Here we describe the concrete process by the three subtasks.
Target Identification The target identification involves both node building and edge building steps. First, all predicate nodes with type FPRD are predicates. Second, there is a small percentage of predicates composed of multiple nodes with type PPRD. If two or more such nodes are connected with predicate-predicate edges, we regard these nodes as one single valid predicate .

Frame Classification
The frame classification decoding is performed straightforwardly for singlenode predicates. For multi-node predicates, there may exist conflicts from the frame classification of different nodes. Concretely, given a multi-node predicate composed of two or more nodes, the maxscored frame evoked by them might be different. Thus, to address this issue, we use the maximum operation achieved by first summing up the softmax distributions over all covered nodes and then fetching the max-scored frame.
Semantic Role Labeling The condition of semantic role labeling is similar to frame classification. For the single-node predicates, the semantic role labeling output is determinative. For the multi-node predicates, we assign role values for the candidate roles inside its predicted frame only, and further select the concrete role node, which is the highest-probability to the covered predicate nodes. previous work and FN1.7 is the latest version used recently which involves more semantics. We follow the previous studies Swayamdipta et al., 2017) to divide the two datasets into the training, validation and test sets, respectively. Table 1 shows the overall data statistics.
Evaluation We measure the performance of frame semantic parsing by its three subtasks, respectively. For target identification, we treat a predicate as correct only when all its included word spans exactly match with the gold-standard spans of the predicate. For frame classification, we use the joint performance for evaluation, regarding a classification as correct only when the predicate, as well as the frame, are both correct. For semantic role labeling, we also use the joint performance regarding the role as correct when the predicate, role span (exact match), and role type are all correct, which is treated as our major metric.
Derived Models Following previous studies and our graph-based method, we can derive a range of basic models for comparisons: • Node, the node building submodel which is the first step of our decoder module mentioned in section 3.   (Sarawagi and Cohen, 2005) model for semantic role labeling which is borrowed from , where the only difference is that we use BERT as the representation layer for fair comparisons. 3 Note that the above derived models are trained individually. Based on these models, we can build five pipeline systems: (1) Predicate + Frame + Role, (2) Predicate•Frame + Role, (3) Predicate + Frame•Role, (4) Predicate•Frame + Semi-CRF, and (5) Node + Edge, which are exploited for comparisons with our graph-based end-to-end model.
Hyperparameters All our codes are based on Allennlp Library (Gardner et al., 2017) and trained on a single RTX-2080ti GPU. We choose the BERTbase-cased 4 , which consists of 12-layer transformers with the hidden size 768 for all layers. We set all the hidden sizes of BiHLSTM to 200, and the number of layer to 6. The MLP layers are of dimension size by 150 and depth by 1, with ReLU function. We apply dropouts of 0.4 to BiHLSTM and 0.2 to MLP layers. Following Swayamdipta 3 According to the preliminary experiments, we find that the fine-tuned method of BERT usage would hurt the Semi-CRF model performance. Therefore, we freeze the BERT parameters for Semi-CRF here. 4 https://github.com/google-research/bert et al. (2018), we also limit the maximum length of spans to 15 for efficiency, resulting in oracle recall of 95% on the development set. For training, we exploit online batch learning with a batch size of 8 to update the model parameters, and use the BertAdamW algorithm with the learning rate 1 × 10 −5 to finetune BERT and 1 × 10 −3 to fine-tune other parts of our model. The gradient clipping mechanism by a maximum value of 5.0 is exploited to avoid gradient explosion. The training process are stopped early if the performance does not increase by 20 epochs. Table 2 shows the main results on the test sets of FN1.5 and FN1.7 datasets respectively, where our end-to-end model is compared with the four strong pipeline methods mentioned in Section 4.1. We can see that the end-to-end joint model can lead to significantly better performance (p-value below 10 −5 by pair-wise t-test) as a whole on both datasets. Concretely, we can obtain average improvements of 0.57+0.75 2 = 0.66 on target identification, 0.49+1.15 2 = 0.82 on frame classification, and 0.66+0.71 2 = 0.69 on semantic role labeling on the two datasets compared with the best results of the pipeline systems, respectively.

Main Results
Besides the overall advantage of the end-to-end joint model over the pipelines, we can also find that the joint of two subtasks can also outperform their counterpart baselines. Concretely, as shown in Table 2 Kshirsagar et al. (2015) 63.10 - Yang and Mitchell (2017) 65.50 - Swayamdipta et al. (2017) 59.48 61.36  69.10 - Marcheggiani and Titov (2020) 69.30 - Bastianelli et al. (2020)   of joint learning. Further, by comparisons between our graph-based Role model and the Semi-CRF one, we can see that the Semi-CRF is better. The reason could be that the Semi-CRF model can exploit higher-order features among different frame roles which are ignored by our simple edge building module. As our edge building considers all predicates and all roles together, the incorporation of such features is still with great inconveniences.

Individual Subtask Evaluation
Previous studies commonly focus on only individual subtasks of frame semantic parsing. In order to compare with these studies, we simulate the scenarios by imposing constraints with gold-standard inputs in our joint models. In this way, we show the capability of our models on individual tasks. 5 In particular, Bastianelli et al. (2020) report the best performance of the previous studies in the literature, which is based on BERT representations. They adopt the constituency syntax which can boost the individual model performances significantly. Since our final model uses no other knowledge except BERT, we report their model performances by with syntax (denoted as w syntax) and without syntax (denoted as wo syntax) for careful comparisons.

Target Identification
We show the performance of previous studies on target identification in Table 3, and also report the results of three-related models derived from this work. First, by comparing our three models (i.e. predicate only, predicate with frame, and the full graph parsing), the results show that both frame classification and semantic role labeling can help target identification. Second, we can see that our final model can achieve the best performance among the previous work.
Frame Classification Noted that we have Table 4 shows the result of individual frame classification tasks, where all systems assume gold-standard predicates as inputs. Similar to target identification, we can achieve better performance than all previous studies. Peng et al. (2018) did not use BERT, but they use extra datasets from FrameNet (exemplar sentences) and semantic dependency parsing, which can also benefit our task greatly. As for the comparison between our implemented two models, Frame alone and our final joint model, the results show that semantic role labeling can benefit the frame classification, which is reasonable.
Semantic Role Labeling Table 5 shows the results of various models on the semantic role labeling task. By constraining gold-standard predicates and frames to the outputs, our model degenerates to a normal semantic role labeling model. We also give the result by using Semi-CRF. As shown, our final semantic role labeling model is highly competitive in comparison with previous studies, except  the model of Bastianelli et al. (2020) with syntax. The exception is expected, since syntax has been demonstrated highly effective before for SRL Peng et al., 2018;Bastianelli et al., 2020). In addition, the Semi-CRF model is better than our method, which is consistent with the results in Table 2.

Discussion
In this subsection, we conduct detailed experimental analyses for better understanding our graphbased methods. Note that if not specified, the analyses are based on the FN1.7 dataset, which has a larger scale of annotations for exploring.
Effectiveness on recognizing different types of predicates For frame-semantic parsing, extracting correct frame-evoking predicates is the first step that influences the later subtasks directly. Here we performed fine-grained analysis for the predicate identification, splitting the predicates into three categories, i.e., single-word predicates (Single), multiword predicates (Multi), respectively. As shown in Figure 3, our joint model can achieve consistent improvements over the pipeline models for all kinds of predicates, indicating that the information from the frame and frame-specific roles are both beneficial for target identification. In addition, the multi-word predicates are more difficult than the single-word predicates, leading to significant decreases as a whole.   Performance by the role length Frame-special roles are the core structures that frame-semantic parsing intends to obtain. It is obvious that roles of different lengths would affect the performance, and longer roles would be much more difficult. Here we bucket the roles into seven categories and report the F1-score of our proposed methods on them. Figure 4 shows the results. We can find that the overall curve declines as the length increases, which is consistent with our intuition, and our graph-based end-to-end model is better than the pipeline methods of all lengths.
Performances of node, frame and edge Our graph-based model builds nodes, determines node attributes (frame), and builds edge sequentially, which is different from the standard pipelines based on target identification, frame classification and semantic role labeling. Thus, it is interesting to see the performance based on node building, frame classification 6 and edge building, respectively. Table 6 shows the results, where the joint model as well as four pipeline models are included. As shown, we can see that the full joint model is better than the partial joint models, and the full pipeline model gives the worst results.
Ignoring discontinuous predicates Although in both FN1.5 and FN1.7 datasets, discontinuous   Table 9: An example for frame suggestion out-thescope-of the predefined LU lexicon, where the blue indicates the suggested frame outside the dictionary, and represent whether inference with the dictionary, respectively.
predicates are significantly smaller in amount than others, we keep it in this work for a more comprehensive study to demonstrate that our model can process them as well. Here we also add the results which ignore the discontinuous predicates (i.e., removing the predicate-predicate edges) to facilitate future studies. As shown in Table 7, our joint model performs better than the pipeline methods, which is consistent with the main results. Table 8 compares the computational efficiency of the strong Semi-CRF baseline and our joint model for semantic role labeling task, which is also an essential measurement of proposed approach. Experimental results are all obtained by running models on a single 2080ti GPU. We could observe that our model can reach an almost ten times faster speed in comparison to Semi-CRF. Even though the Semi-CRF implementation 7 uses dynamic programming to optimize the time complexity, it still needs to iterate over segments of each sentence in the batch one by one, which might not take advantage of the GPU's parallel capabilities to accelerate the process. Nevertheless, our model as a whole adopts batch-based learning, which enables more efficient inference. 7 https://github.com/swabhs/scaffolding. Frame classification without dictionary Following Swayamdipta et al. (2017), we also adopt the Lexical Unit (LU) dictionary in our model empirically. However, according to Punyakanok et al. (2008), sometimes the dictionary might be quite limited. Therefore, we offer one eaxmple in Table 9 to illustrate the capability of our model for frames not in the dictionary. As shown, our model could predict the appropriate frame outside the dictionary as well and might additionally enrich the gold-standard annotations (i.e., the blue texts which do not appear in the Ground Truth).

Conclusion
In this paper, we proposed a novel graph-based model to address the end-to-end frame semantic parsing task. The full frame semantic parsing result of one sentence is organized as a graph, and then we suggest an end-to-end neural model for the graph building. Our model first encodes the input sentence for span representations with BERT, and then constructs the graph nodes and edges incrementally. To demonstrate the effectiveness of our method, we derived several pipeline methods and used them to conduct the experiments for comparisons. Experimental results showed that our graph-based model achieved significantly better performance than various pipeline methods. In addition, in order to compare our models with previous studies in the literature, we conducted experiments in the scenarios of the individual subtasks. The results showed that our proposed models are highly competitive.