A Double-Graph Based Framework for Frame Semantic Parsing

Frame semantic parsing is a fundamental NLP task, which consists of three subtasks: frame identification, argument identification and role classification. Most previous studies tend to neglect relations between different subtasks and arguments and pay little attention to ontological frame knowledge defined in FrameNet. In this paper, we propose a Knowledge-guided Incremental semantic parser with Double-graph (KID). We first introduce Frame Knowledge Graph (FKG), a heterogeneous graph containing both frames and FEs (Frame Elements) built on the frame knowledge so that we can derive knowledge-enhanced representations for frames and FEs. Besides, we propose Frame Semantic Graph (FSG) to represent frame semantic structures extracted from the text with graph structures. In this way, we can transform frame semantic parsing into an incremental graph construction problem to strengthen interactions between subtasks and relations between arguments. Our experiments show that KID outperforms the previous state-of-the-art method by up to 1.7 F1-score on two FrameNet datasets. Our code is availavle at https://github.com/PKUnlp-icler/KID.


Introduction
The frame semantic parsing task (Gildea and Jurafsky, 2002;Baker et al., 2007) aims to extract frame semantic structures from sentences based on the lexical resource FrameNet (Baker et al., 1998). As shown in Figure 1, given a target in the sentence, frame semantic parsing consists of three subtasks: frame identification, argument identification and role classification. Frame semantic parsing can also contribute to downstream NLP tasks such as machine reading comprehension (Guo et al., 2020), relation extraction  and dialogue generation (Gupta et al., 2021). * Corresponding author Figure 1: Given the target receive in this sentence, the frame identification is to identify the frame Receiving evoked by it; the argument identification is to find the arguments (He, the book, ...) of this target; the role classification is to assign frame elements (Recipient, Theme, ...) as semantic roles to these arguments.
FrameNet is an English lexical database, which defines more than one thousand hierarchicallyrelated frames to represent situations, objects or events, and nearly 10 thousand FEs (Frame Elements) as frame-specific semantic roles with more than 100,000 annotated exemplar sentences. In addition, FrameNet defines ontological frame knowledge for each frame such as frame semantic relations, FE mappings and frame/FE definitions. The frame knowledge plays an important role in frame semantic parsing. Most previous approaches (Kshirsagar et al., 2015;Yang and Mitchell, 2017;Peng et al., 2018) only use exemplar sentences and ignore the ontological frame knowledge. Recent researches (Jiang and Riloff, 2021;Su et al., 2021) introduce frame semantic relations and frame definitions into the subtask frame identification. Different from previous work, we construct a heterogeneous graph named Frame Knowledge Graph (FKG) based on frame knowledge to model multiple semantic relations between frames and frames, frames and FEs, as well as FEs and FEs. Furthermore, we apply FKG to all subtasks of frame semantic parsing, which can fully inject frame knowledge into frame semantic parsing. The knowledgeenhanced representations of frames and FEs are learned in a unified vector space and this can also strengthen interactions between frame identification and other subtasks.
Most previous systems neglect interactions be- Figure 2: An example of how frame knowledge contributes to frame semantic parsing. The frame semantic relations and FE mappings guide inter-frame reasoning (from left sentence to right); and the FE definitions help with intra-frame reasoning (Theme to Role and Role to Donor).
tween subtasks, they either focus on one or two subtasks (Hermann et al., 2014;FitzGerald et al., 2015;Marcheggiani and Titov, 2020) of frame semantic parsing or treat all subtasks independently Peng et al., 2018). Furthermore, in argument identification and role classification, previous approaches process each argument separately with sequence labeling strategy (Yang and Mitchell, 2017;Bastianelli et al., 2020) or spanbased graphical models Peng et al., 2018). In this paper, we propose Frame Semantic Graph (FSG) to represent frame semantic structures and treat frame semantic parsing as a process to construct this graph incrementally. With graph structure, historical decisions of parsing can guide the current decision of argument identification and role classification, which highlights interactions between subtasks and arguments. Based on two graphs mentioned above, we propose our framework KID (Knowledge-guided Incremental semantic parser with Double-graph). FKG provides a static knowledge background for encoding frames and FEs while FSG represents dynamic parsing results in frame semantic parsing and highlights relations between arguments.
Overall, our contributions are listed as follow: • We build FKG based on the ontological frame knowledge in FrameNet. FKG incorporates frame semantic parsing with structured frame knowledge, which can get knowledgeenhanced representations of frames and FEs.
• We propose FSG to represent the frame semantic structures. We treat frame semantic parsing as a process to construct the graph incrementally. This graph focuses on the targetargument and argument-argument relations.
We evaluate the performance of KID on two FrameNet datasets: FN 1.5 and FN 1.7, the results show that the KID achieves state-of-the-art on these datasets by increasing up to 1.7 points on F1-score. Our extensive experiments also verify the effectiveness of these two graphs.

Ontological Frame Knowledge
Frame semantics relates linguistic semantics to encyclopedic knowledge and advocates that one cannot understand the semantic meaning of one word without essential frame knowledge related to the word (Fillmore and Baker, 2001). The frame knowledge of a frame contains frame/FE definitions, frame semantic relations and FE mappings. FrameNet defines 8 kinds of frame semantic relations such as Inheritance, Perspective_on and Using; for any two related frames, the FrameNet defines FE mappings between their FEs. For example, the frame Receiving inherits from Getting and the FE Donor of Receiving is mapped to the FE Source of Getting. Each frame or FE has its own definition and may mention other FEs.
We propose two ways of reasoning about frame semantic parsing: inter-frame reasoning and intraframe reasoning in Figure 2. Frame knowledge mentioned above can guide both ways of reasoning. The frame semantic relation between Receiving and Getting and FE mappings associated with it allow us to learn from the left sentence when parsing the right sentence because similar argument spans of two sentences will have related FEs as their roles. The FE definitions reflect dependencies between arguments. The definition of Role in frame Receiving mentions Theme and Donor, which reflects dependencies between argument the book and argument as a gift.
Given a sentence S = w 0 , . . . , w n−1 with a target span t in S, the frame semantic parsing aims to extract the frame semantic structure of t. Suppose that there are k arguments of t in S: a 0 , . . . , a k−1 , subtasks can be formulated as follow: • Frame identification: finding an f ∈ F evoked by target t, where F denotes the set of all frames in the FrameNet.
• Argument identification: finding the boundaries i s τ and i e τ for each argument a τ = w i s τ , . . . , w i e τ . • Role classification: assigning an FE r τ ∈ R f to each a τ , where R f denotes the set of all FEs of frame f .

Method
KID encodes all frames and FEs to knowledgeenhanced representations via frame knowledge graph encoder (section 4.1). For a sentence with a target, contextual representations of tokens are derived from the sentence encoder (section 4.2). Frame semantic parsing is regarded as a process to build FSG incrementally from an initial target node to complete FSG. Frame identification finds a frame evoked by the target and combines the target with its frame into the initial node of FSG (section 4.3.1). Argument identification (section 4.3.2) and role classification (section 4.3.3) for each argument is based on the current snapshot of partial FSG considering all historical decisions. Section 4.4 tells how frame semantic graph decoder encodes partial FSG to its representation and how it expands FSG incrementally, which is also shown in Figure 4.

Frame Knowledge Graph Encoder
FKG is an undirected multi-relational heterogeneous graph, and Figure 3 shows a subgraph of FKG. Its nodes contain both frames and FEs and there are four kinds of relations in FKG: frame-FE, frame-frame, inter-frame FE-FE and intra-frame FE-FE relations. The following will show how we extract these relations from frame knowledge: Frame-FE: we connect a frame with its FEs so that we can learn representations of frames and FEs in a unified vector space to strengthen interactions between frame identification and other subtasks. Frame-frame and inter-frame FE-FE: these two kinds of relations are frame semantic relations and FE mappings respectively and here we ignore relation types of frame semantic relations. They can both guide inter-frame reasoning in Figure 2. Intra-frame FE-FE: If the definition of an FE mentions another FE in the same frame, they will have intra-frame FE-FE relations with each other. This relation can help with intra-frame reasoning and strengthen interactions between arguments. The frame knowledge graph encoder aims to get knowledge-enhanced representations of nodes in FKG via an RGCN (Schlichtkrull et al., 2018) module. We use F to represent all frames in FrameNet and R f to represent all FEs of frame f . In addition, we use R = f ∈F R f to represent all FEs in the FrameNet. Let 0, . . . , |F| − 1 denote all frames and |F|, . . . , |F| + |R| − 1 denote all FEs. Moreover, we introduce a special dummy node indexing |F| + |R| into FKG. So the vectors y 0 , . . . , y M ∈ R dn denote the representations of all nodes in FKG, where M = |F| + |R|.
For each node i, we take a randomly initialized embedding y The RGCN module models four kinds of relations: Frame-FE, intra-frame FE-FE, frame-frame and inter-frame FE-FE. For better modeling interframe relations between FEs, we also fuse name information into representations of FEs. The FEs whose names are the same will share the same embeddings, i.e. for i, j ≥ |F|, y if the name of i is the same as j. Figure 4: Based on the representation g τ of partial G τ , frame semantic graph decoder identifies new argument as a gift with pointer networks, and label it with FE Role. G τ will be updated to G τ +1 with (as a gift, Role).
Furthermore, we use boundary information (Wang and Chang, 2016;Cross and Huang, 2016;Ouchi et al., 2018) to represent spans like s = w i , . . . , w j based on token representations because we need to embed spans into the vector space of FKG in frame identification and role classification: The dimension of Q(i, j) is d n . The ⊕ denotes concatenation operation and FFN denotes Feed Forward Network.

Frame Identification
A frame f ∈ F will be identified based on the target t, representations of tokens h 0 , . . . , h n−1 and representations of frames y 0 , . . . , y |F |−1 with a scoring module. The target t = w i s t , . . . , w i e t will be embedded to the vector space of all frames as γ t ∈ R dn . We can calculate dot product normalized similarities between γ t and all frames :

Argument Identification
Based on g τ , the representation of current snapshot of FSG G τ , we need to find an argument a τ = w i s τ , . . . , w i e τ . We use pointer networks (Vinyals et al., 2015) to identify its start and end positions i s τ and i e τ separately via an attention mechanism, which is more efficient than traditional span-based model (Chen et al., 2021). Take i s τ as example: represents the output of the sentence encoder, and ρ s τ ∈ R d h is used to find the start position of argument span a τ .

Role Classification
Based on g τ and a τ , we embed a τ into the vector space of FEs as γ aτ ∈ R dn . Similar to frame identification, we calculate dot product normalized similarities between γ aτ and all FEs Y R = (y |F | , . . . , y |F |+|R| ) ∈ R dn×(|R|+1) to get the conditional probability distribution of r given a τ and G τ .

Frame Semantic Graph Decoder
We propose FSG to represent the frame semantic structure of t in the sentence S and we treat the frame semantic parsing as a process to construct FSG incrementally. Intermediate results of FSG are partial FSGs representing all historical decisions, which highlights interactions between arguments. Suppose that there are k arguments of target t: a 0 , . . . , a k−1 with their roles r 0 , . . . , r k−1 . For τth snapshot of FSG G τ , it contains τ +1 nodes: one target node (t, f ) and τ argument nodes (if exist) (a 0 , r 0 ), . . . , (a τ −1 , r τ −1 ). The target node will be connected with all argument nodes. The indices of nodes in G τ depend on the order in which they are added into the graph, 0 denotes the target node and 1, . . . , τ denotes (a 0 , r 0 ), . . . , (a τ −1 , r τ −1 ). We encode G τ to its representation g τ : where i f and i r j denotes indices of f and r j in FKG, and π a j = Q(i s j , i e j ). The GCN module is to encode partial FSG.
Based on the representation g τ of each snapshot G τ , KID predicts boundary positions of argument a τ and assign an FE r τ as its semantic role (section 4.3.2,4.3.3). The G τ will be updated to G τ +1 with the new node (a τ , r τ ) until the r τ is the special dummy node in FKG. Figure 4 shows how to find a new node and add it into the FSG.

Training
We train KID with all subtasks jointly by optimizing the loss function L since representations of frames and FEs are learned in a unified vector space.
where f gold is the gold frame and I s τ , I e τ , r gold τ are gold labels of argument a τ . r gold k is "Dummy", indicating the end of the parsing. We force our model to identify arguments in a left-to-right order, i.e. a 0 is the leftmost argument in S. We use gold frame in the initial node of FSG: G 0 = (t, f gold ) while other nodes are predicted autoregressively:

Inference
KID predicts frame and all arguments with their roles in a sequential way. We use probabilities above with some constraints: 1. We use lexicon filtering strategy: for a target t, we can use the lemma t of it to find a subset of frames F t ⊂ F so that we can reduce the searching space; 2. Similarly, we take Rf instead of R as the set of candidate FEs; 3. In argument identification, we will mask spans that are already selected as arguments, and i e τ should be no less than i s τ .   Su et al. (2021). We also train our model with multiple runs and report the statistical analysis with a significance testing in appendix C.
(2018); Chen et al. (2021) to include exemplar instances as training instances. As Kshirsagar et al. (2015) states that there exists a domain gap between exemplar instances and original training instances, we follow Chen et al. (2021) to use exemplar instances as pre-training instances and further train our model in original training instances. Table 1 shows the numbers of instances in two datasets.

Empirical Results
We compare KID with previous models (see appendix C) on FN 1.5 and FN 1.7. We focus on three metrics: frame acc, arg F1 and full structure F1. 2 Full structure F1 shows the performance of models on extracting full frame semantic structures from text, frame acc denotes accuracy of frame identification and arg F1 evaluates the results of argument identification and role classification with gold frames. All metrics are evaluated in test set. Table 2 shows results on FN 1.5. For a fair comparison, we divide models into three parts: the first part of models do not use exemplar instances as training data; the second part of models use exemplar instances without any pretrained language models; the third part of models use pretrained language models. KID (GloVe) uses GloVe (Pennington et al., 2014) as word embeddings and KID (BERT) fine-tunes pretrained language model BERT (Devlin et al., 2019) to encode word representations. KID achieves state-of-the-art of these metrics under almost all circumstances, and we also train our model with multiple runs, which shows KID (GloVe + exemplar) and KID (BERT) outperforms previous state-of-the-art models by 1.4 and 1.3 full structure F1-score averagely. There is an exception that our model with BERT does not outperform Su et al. (2021) on frame identification accuracy and we find that the number of train, validation and test instances reported by them are a little bit smaller than ours. Results on FN 1.7 and statistical analysis of our model with multiple runs are listed in appendix C.
It is worth noting that KID achieves much higher recall than other models. We attribute this to the incremental strategy of building FSG. By constructing FSG incrementally, KID can capture relations between arguments and identify arguments that are hard to find in other models.

Ablation Study
To prove the effectiveness of double-graph architecture, we conduct further experiments with KID on FN 1.5.   sequence of arguments and their roles that have already been identified as input to predict the next argument. FSG performs better than LSTM because it captures target-argument and argument-argument relations and can model long-distance dependencies. w/o FKG directly uses input vectors of frame knowledge graph encoder, and results also show that knowledge-enhanced representations are better than randomly initialized embeddings. We also test the influence of double-graph structure with pre-trained language models, the results shows the double-graph structure is still effective and useful even with the pre-trained language models. FKG is a multi-relational heterogeneous graph. The ablation study on structures of FKG is shown in Table 4. In addition, we evaluate the performance of FI w/o FKG, which identifies frames with a simple linear classification layer instead of FKG, and the results prove that FKG strengthens interactions between frame identification and role classification.
In addition, we explore the effectiveness of name information of FEs. Whether the name information is used in previous work is unclear and some   BIO-based approaches like Marcheggiani and Titov (2020) are likely to use name information by regarding FEs with the same name as the same label in role classification. To the best of our knowledge, we are the first one to study the effectiveness of name information. As shown in Table 5, the names of FEs provide rich information for frame semantic parsing and shared embedding strategy can make good use of the name information. Further ablation study of FKG is conducted under the circumstance without name information, the performance will drop 0.9 points if we remove FKG too, showing that knowledge-enhanced representations are important no matter whether we share embeddings for FEs with the same names or not.

Transfer learning ability of FKG
As we have discussed in Figure 2, if frame B is related to frame A, a sentence with frame A can contribute to parsing another sentence with frame B by inter-frame reasoning. Frame-frame and interframe FE-FE relations of FKG can guide KID to learn experience from other frames. To confirm that FKG has ability of transfer learning, we design zero (few)-shot learning experiments on FN 1.7. Target word get can evoke multiple frames in FrameNet, and we choose instances including target get with three frames (Arriving, Getting and Transition_to_state) as test instances. We remove all instances with target get from train and development instances, and can selectively add few (or zero) instances including other targets with these three frames into train and development sets. We then compare the performance of KID with KID w/o FKG under zero-shot and few-shot circumstances. If FKG has ability of transfer learning, KID with FKG can learn experience from other related frames like Receiving and its performance will not be influenced so much by the sparsity of labels. Table 6 shows the results of our experiments. K = 0 indicates zero-shot learning while K = {4, 16, 32} indicates few-shot learning. KID without FKG performs much worse in zero-shot learning. As the number of instances that can be seen in training grows up, the performance of KID with FKG gets a steady increase while the performance of KID without FKG increases rapidly. Results verify our assumption that even with few train instances FKG can guide inter-frame reasoning with its structure and allow models to learn experience from other seen frames.

Related Work
Frame semantic parsing has caught wide attention since it was released on SemEval 2007 (Baker et al., 2007). The task is to extract frame structures defined in FrameNet (Baker et al., 1998) from text. From then on, a large amount of systems are applied on this task, ranging from traditional machine learning classifiers (Johansson and Nugues, 2007;Das et al., 2010) to fancy neural models like recurrent neural networks (Yang and Mitchell, 2017;Swayamdipta et al., 2017) and graph neural networks (Marcheggiani and Titov, 2020;Bastianelli et al., 2020).
A lot of previous systems neglect interactions between subtasks and relations between arguments. They either focus on one or two subtasks (Hermann et al., 2014;FitzGerald et al., 2015;Marcheggiani and Titov, 2020) of frame semantic parsing or treat all subtasks independently Peng et al., 2018).  propose an efficient global graphical model, so they can enumerate all possible argument spans and treat the assignment as the Integer Linear Programming problem.  Yang and Mitchell (2017) integrate these two methods with a joint model. Only few approaches like Chen et al. (2021) model interactions between subtasks, which use the encoderdecoder architecture to predict arguments and roles sequentially. However, the sequence modeling of Chen et al. (2021) does not consider structure information and is not good at capturing long-distance dependencies. We use graph modeling to enhance structure information and strengthen interactions between target and argument, argument and argument.
Only a few systems utilize linguistic knowledge in FrameNet. Kshirsagar et al. (2015) use FE mappings to share information in FEs. In frame identification, Jiang and Riloff (2021) encode definitions of frames and Su et al. (2021) use frame identification and frame semantic relations. However, they do not utilize ontological frame knowledge in all subtasks while we construct a heterogeneous graph containing both frames and FEs. Besides, our model does not need extra encoders to encode definitions, which reduces parameters of the model. Some systems also treat constituency parsing or other semantic parsing tasks like AMR as a graph construction problem. Yang and Deng (2020) use GCN to encode intermediate constituency tree to generate a new action on the tree. Cai and Lam (2020) construct AMR graphs with the Transformer (Vaswani et al., 2017) architecture.

Conclusion
In this paper, we incorporate knowledge into frame semantic parsing by constructing Frame Knowledge Graph. FKG provides knowledge-enhanced representations of frames and FEs and can guide intra-frame and inter-frame reasoning. We also propose frame semantic graph to represent frame semantic structures. We regard frame semantic parsing as an incremental graph construction problem. The process to construct FSG is structureaware and can utilize relations between arguments. Our framework Knowledge-guided Incremental semantic parser with Double-graph (KID) achieves state-of-the-art on FrameNet benchmarks. However, how to utilize linguistic knowledge better is still to be resolved. Future work can focus on better modeling of ontological frame knowledge, which will be useful for frame semantic parsing and transfer learning in frame semantic parsing.

A.1 Graph Convolutional Network
Graph convolution is introduced in Kipf and Welling (2016). A GCN layer is defined as follow: where H (l) denotes hidden representations of nodes in l-th layer, σ is the non-linear activation function (e.g. ReLU) and W is the weight matrix.D andÃ are separately the degree and adjacency matrices for the graph G. From a node-level perspective, a GCN layer can be also formalized as follow: where N (i) is the set of neighbors of node i, and c ji = |N (j)| |N (i)|. By stacking L different GCN layers, we get final GCN module GCN(H (0) , G).

A.2 Relational Graph Convolutional Network
Type information of graph edges is ignored in GCN, and RGCN (Schlichtkrull et al., 2018) is proposed to model relational data from which we can benefit to model the multi-relational graph FKG. Different edge types use different weights and only edges of the same relation type r are associated with the same projection weight W r . From a node's view: 20) where N r i denotes the set of neighbor indices of node i under relation r ∈ R and c i,r is a normalization constant i.e.|N r i |. In KID, we use tanh as activation function of RGCN for normalization because we need to calculate normalized dot product similarities between frames/FEs and target/arguments.

A.3 Encoding Dependency Tree
We follow previous studies (Marcheggiani and Titov, 2020;Bastianelli et al., 2020) to use syntax structures like dependency tree T of S in KID because syntax structure is proved beneficial to semantic parsing. We use Stanza (Qi et al., 2020), an open-source python NLP toolkit to parse dependency syntactic structure for instances, and depGCN (Marcheggiani and Titov, 2017) to encode the syntactic structure. We simplify depGCN by ignoring directions and labels of edges in dependency tree, which means if token i is head or dependent of token j, we will have A T ij = A T ji = 1 in adjacency matrix A T of T .
In addition, if we use BERT as encoder, the tokens are sub-word level and the adjacency matrix will be a little bit different. Specifically, if token i is the sub-word of some word u, token j is the sub-word of some word v and u is head or dependent word of v, we will have A T ij = A T ji = 1 in adjacency matrix A T .

B Hyper-parameter Setting
For replicability of our work, we list hyperparameter settings of KID (GloVe) and KID (BERT) in Table 7 and 8. We use the development set to manually tune the optimal hyper-parameters based on Full structure F1. The values of hyperparameters finally selected are in bold. Token embeddings we use in KID (GloVe) are the same as Chen et al. (2021), including word, lemma and POS tag embeddings with a binary type embedding to distinguish whether a token is a target or not.     (2021): a BERT-based model for frame identification using both frame identification and frame semantic relations.
C.2 Empirical Results on FN 1.7 Table 9, 10, 11 list our results with comparing models. KID outperforms previous state-of-the-art except Su et al. (2021). FN 1.7 is the up-to-date extension version of FN 1.5 containing more finegrained frames and FEs. However, there are only few models reporting their results on FN 1.7 and we hope that future work on frame semantic parsing can be more focused on FN 1.7.

C.3 Time Costs of FKG
The FKG is built over the full FrameNet containing more than 10,000 nodes while the intra-frame and inter-frame relations make the graph larger. Since we need to encode the full FKG when parsing a single sentence, it's necessary to explore the time costs    Table 12 and we can find the time encoding FKG is approximately 20% in the whole runtime and may slightly hurt the efficiency of our models. However, in inference time, the representations of nodes in FKG are fixed and we can load the representations offline to reduce the inference time.
C.4 Statistical Analysis of KID on FN 1.5 For evaluating solidity of our model, we train KID with five random seeds. The average performances with deviation and results of significance testing are listed in Table 13. The significance testing is to show whether our model significantly outperforms previous state-of-the-art, and we do not conduct significance testing for KID (BERT) because we do not outperform Su et al. (2021) on frame accuracy. All p-values are less than 0.05 and even some pvalues are less than 1e-3, which proves the solidity of our model.  Table 13: Statistical analysis of multiple runs on FN 1.5. We train our model with five different random seeds s 1 − s 5 and the results with seed s 4 are reported in Table 2. We both report the average performance with deviation and the results of significance testing, where * denotes the p-value is less than 1e-3.