A Knowledge-Guided Framework for Frame Identification

Frame Identification (FI) is a fundamental and challenging task in frame semantic parsing. The task aims to find the exact frame evoked by a target word in a given sentence. It is generally regarded as a classification task in existing work, where frames are treated as discrete labels or represented using onehot embeddings. However, the valuable knowledge about frames is neglected. In this paper, we propose a Knowledge-Guided Frame Identification framework (KGFI) that integrates three types frame knowledge, including frame definitions, frame elements and frame-to-frame relations, to learn better frame representation, which guides the KGFI to jointly map target words and frames into the same embedding space and subsequently identify the best frame by calculating the dot-product similarity scores between the target word embedding and all of the frame embeddings. The extensive experimental results demonstrate KGFI significantly outperforms the state-of-the-art methods on two benchmark datasets.


Introduction
Frame Identification (FI) aims to find the exact frame evoked by a target word in a given sentence. A frame represents an event scenario, and possesses frame elements (or semantic roles) that participate in the event (Hermann et al., 2014), which is described in the FrameNet knowledge base (Baker et al., 1998;Ruppenhofer et al., 2016) grounded on the theory of Frame Semantics (Fillmore et al., 2002). The theory asserts that people understand the meaning of words largely by virtue of the frames which they evoke. In general, many words are polysemous and can evoke different frames in different contexts.
As shown in Figure 1, the word stopped evokes the frame Activity stop and the frame The fighting has stopped for more than two years . examples with the target word marked in bold and frame elements (semantic roles) in rounded rectangles. The target word stopped (stop.v denotes its form of lexical unit) evokes the frame Activity stop and the frame Process stop respectively in different contexts. Here, the key to distinguish these two frames is identifying whether the subject (The compony or The fighting) of stopped is an Agent or a Process (see the frame definitions in Table 1).
Process stop respectively in two sentences. It is a challenging task to distinguish the frames evoked by target words in sentences. Furthermore, FI is a key step before Frame Semantic Role Labeling (FSRL) (Das et al., 2010Swayamdipta et al., 2017;Kalyanpur et al., 2020) which is widely used in event recognition (Liu et al., 2016), machine reading comprehension (Guo et al., 2020b,a), relation extraction , etc. Through FI process, hundreds of role labels in FrameNet are reduced to a manageable small set (Hartmann et al., 2017), which can significantly improve the performance of FSRL models. Thus, FI is a fundamental and critical task in NLP.
FI is typically regarded as a classification task, in which class labels are frame names. In earlier studies, researchers manually construct features and then use supervised learning methods to learn classification models (Bejan and Hathaway, 2007;Johansson and Nugues, 2007;Das et al., 2010. These methods, however, do not take the valuable semantic information about frames into considera-  v, cease.v, halt.v, quit.v, stop.v, ... cease.v, halt.n, shutdown.n, stop.v,...

FRs
Inherits from: Process stop Inherits from: Event Subframe of: Activity Subframe of: Process Is Inherited by: Halt Is Inherited by: Activity stop Uses: Eventive affecting tion, and merely treat them as discrete labels.
The recent studies of FI use distributed representations of target words and their syntactic context to construct features, and construct classification models with deep neural network (Hartmann et al., 2017;Kabbach et al., 2018). These methods usually transform frame labels into one-hot representations (Hermann et al., 2014;Täckström et al., 2015), and then learn the embeddings of target words and frames simultaneously. However, the abundant semantic information and structure knowledge of frames contained in FrameNet are still neglected.
Knowledge of frames defined by linguists, such as frame definition, frame elements and frame-toframe relations, can enrich frame labels with rich semantic information that can potentially guide FI models to learn more unique and distinguishable representations. Thus, in this paper, we propose a Knowledge Guided Frame Identification framework (KGFI) which consists of a Bert-based context encoder and a frame encoder based on a specialized graph convolutional network (FrameGC-N). In particular, the frame encoder incorporates multiple types of frame knowledge into frame representation which guides the KGFI to jointly map target words and frames into the same embedding space. Instead of predicting the frame label directly, KGFI chooses the best suitable frame evoked by the target word in a given sentence by calculating the dot-product similarity scores between the target word embedding and all of the frame embeddings. In summary, our contribution is threefold: • To the best of our knowledge, we are the † See the details in https://FN.icsi.berkeley.edu/fndrupal/ first to propose a unified FI method which leverages heterogeneous frame knowledge for building rich frame representations.
• We design a novel Framework KGFI, consisting of a Bert-based context encoder and a GCN-based frame encoder, which learns the model from a combination of annotated data and FrameNet knowledge base, and maps target words and frames into the same embedding space.
• Extensive experimental results demonstrate our proposed KGFI framework outperforms the state-of-the-art models across two benchmark datasets.
2 FrameNet and FI Task Definition

FrameNet
FrameNet is built on the hypothesis that people understand things by performing mental operations on what they already know (Baker et al., 1998). Such knowledge reflecting people's cognitive experience is described as structured information packets called frames. A frame represents an event scenario, associated with a set of semantic roles (frame elements (FEs)). Lexical units (LUs) are capable of evoking the scenario (Kshirsagar et al., 2015). Frame elements in terms of how central they are to a particular frame can be divided into three distinguishing levels: core, peripheral and extra-thematic. Each frame has a textual definition (Def), depicting the scenario and how the roles interact in the scenario. Frames are organized as a network with several kinds of frame-to-frame relations (FRs).   Table 1 shows the structure of frame Activity stop and frame Process stop in FrameNet.

FI Task Definition
Frame Identification (FI) is the task of predicting a frame evoked by the target word in a sentence. Let c=w 0 ,w 1 ,...,w st ,...,w en ,...,w n denote a given sentence, and t=w st ,...,w en (t ⊂ c) represent the target word, where st and en are the start and end index respectively for the target word t in the sentence.
.., f |F| ) denote the set of all frames in FrameNet. The FI model is defined as a mapping function G : (c, t, st, en) → f j , subject to f j ∈ F. Table 1 illustrates the structured knowledge (Def, FEs, LUs) of two different frames and their frameto-frame relations (FRs). We explicitly leverage them to enrich the frame embeddings with semantic information. The resulted informative frame representations serve two purposes: 1) guide our model to learn more distinguishable embeddings of target words, and 2) improve FI model's generalization performance in the prediction phase. The proposed KGFI framework consists of three components: context encoder, frame encoder and scoring module, as shown in Figure 2. Specifically, context encoder is used to represent the context-aware target word into an embedding with a Bert-based module, and frame encoder is used to incorporate three types of knowledge about a frame into frame embeddings. With the guidance of the knowledge about frames, two encoders jointly learn the embeddings of target words and frames. Finally, a scoring module is used to calculate the similarity scores between the given target word embedding and all frames' embeddings, to identify the best frame with the highest score.

Context Encoder
To get the context-aware embeddings of target words, we employ Bert (Devlin et al., 2019) for our context encoder, since its architecture is a multilayer bidirectional Transformer which can aggregate information from context into the target word through the self-attention mechanism. As we know, Bert model is pre-trained on a large corpus and can transfer language knowledge into the context encoder, which is very helpful for the target word representation as the manually labeled training data of FI is very small.
The context encoder, which we define as E c , takes given sentence c containing a target word t as input. We denote the last layer of Bert's output as H t . The context encoder can be expressed as : W c ∈ R n×m and b c ∈ R m are learned parameters.

Frame Encoder
In FrameNet, all the frames are connected into a directed graph through the frame-to-frame relations, as shown in Figure 3. Moreover, the graph convolutional network(GCN) (Kipf and Welling, 2017) has been proved to be effective to model the relationship between labels (Yan et al., 2019;Chen et al., 2019;Cheng et al., 2020;Linmei et al., 2019), and it can enrich the representation of the node through aggregating information from its neighbors. In order to make better use of frame knowledge and the advantage of GCN, we propose a specialized GCN, called FrameGCN, to incorporate multiple frame knowledge into frame representations. Activity Activity Figure 3: The sub-graph of overall graph of FrameNet1.7 corresponding to frame Activity stop and Process stop. The nodes denote frames and the directed edges denote frame-to-frame relations. The black "→", red "→" and blue "→" denote Inheritance, Using and Subframe relations respectively, and the direction of an arrow is from super-frame to sub-frame.

Structure of FrameGCN
FrameGCN is a combination of two dedicated GC-Ns (FEsGCN and DefGCN) and an attention network, as shown in Figure 2. FEsGCN is used to represent frame by aggregating the FEs features of its neighbors, while DefGCN is used to represent frame by aggregating the Def features of its neighbors. The attention network is responsible for incorporating the outputs of two GCNs into one unified embedding where adjacent matrix A is shared by the two dedicated GCNs. Frame-to-frame relation in FrameNet is a asymmetric relation between two frames, where one frame is called super-frame and the other is called sub-frame, as shown in Figure 3. A frame typically obtains/inherits more information from its super-frame than from its sub-frame. Therefore, we define the adjacent matrix of the graph as a weighted asymmetric matrix denoted as A = (a i j ) |F|×|F| , where Three types of frame-to-frames relations, including Inherits, Using and Subframe, are used in this study.

FEsGCN
The FEs of a frame express its semantic roles and structure. Frames which have similar structures imply that they have close semantic, so we regard FEs as features and use them to represent frames. Let FE = (e 1 , e 2 , ..., e |FE| ) denote the set of all frame elements in FrameNet, and V e ∈ R |F|×|FE| denote the feature matrix of frames represented by FEs. The ith row of V e is the feature vector of ith frame f i , and can be expressed as FEsGCN is used to learn a map function which maps the node (frame) vectors represented by FEs to a new representation via convolutional operation defined by A. We take a two-layer GCN to implement the map function, which can be expressed as: Here, W (0) e ∈ R |FE|×h is an input-to-hidden weight matrix for the hidden layer and W (1) e ∈ R h×m is a hidden-to-output weight matrix.

DefGCN
Since the frame definition is a short text that depicts an event scenario and frame elements that participate in the event, we employ Bert as a feature extractor to construct the feature matrix V d of frames. Specifically, we first input a frame definition into Bert, and subsequently take the first token's representation (corresponding to the input [CLS] token) in Bert's last layer as the feature vector of the frame. Since the name of a frame is also meaningful, we concatenate the frame name and frame definition into one string, e.g. Activity stop: an agent ceases an activity without completing it.
DefGCN is used to learn a map function which maps the node (frame) vectors represented by definition to a new representation via convolutional operation defined by A. We use a network similar to FEsGCN, which can be expressed as: Here, W (0) d ∈ R n×h is an input-to-hidden weight matrix for a hidden layer with h feature maps, and W (1) e ∈ R h×m is a hidden-to-output weight matrix.

Attentive Graph Combination
We use an attention network to dynamic incorporate the outputs of FEsGCN and DefGCN into one frame embedding through the attention weighting 5234 mechanism. The incorporation operation takes the following function: where r f i ∈ R m is the embedding of ith frame, g k (A,V k ) i is the ith row of convolved representation of graph k, and a i,k is a weight of ith frame against the graph k, which is computed as: where w a ∈ R m is a learnable vector.

Scoring and Prediction
After obtaining the embeddings of target words and frames through context encoder and frame encoder respectively, we score a target word t with each frame f j ∈ F by computing the dot product similarity between r t and each r f for f j ∈ F: S(r t , r f j ) = r t .r f j , j = 1, 2, ..., |F| During training, all model parameters are jointly learned by minimizing a cross-entropy loss: where D is the number of the training data, |F| is the total number of frames in FrameNet, y i j (onehot representation of frame labels) andŷ i j are true labels. The predicted probability over frames is calculated by the softmax function over the scores. During prediction, we predict the frame evoked by the target word t to be f j ∈ F, whose representation r f j has the highest score with r t . The prediction function is defined as: Note most of the frames contain a set of lexical units (LUs) in the form of lemma.POS (e.g. stop.v). As shown in Table 1, the LUs of the frame Activity stop and the frame Process stop are listed in the fourth row. Therefore, we adopt the lexicon filtering operation to reduce the possible candidate frame set. Firstly, we utilize lemmatization and POS tools to convert the target word t into the form of LU (e.g. stop.v). Secondly, we use this LU to match the frames whose LUs contains this LU, and then use the matched frames as the possible candidate frame set F t for the target word t. At last, we predict the frame label by the following function: In the light of the coverage issue of FrameNet (see Section 4.4), these two prediction functions (11 and 12) can be used in different circumstances. In general, we can first use LU to obtain candidate frame set F t by performing lexicon filtering and then use function 12 to identify best frame from F t . However, if we can not find any candidate frame using LUs, i.e. F t = / 0, then we have to identify best frame from F using function 11. Note that F t only contains a couple of candidate frames, while F contains more than one thousand of frames. This requires FI models have very good generalization performance to handle a big F set.

Datasets
We have employed two knowledge bases, i.e. FrameNet1.5 and FrameNet1.7. Both of them contain various documents which have been annotated manually, including target words and corresponding evoked frames. Documents and corresponding annotations in FrameNet1.7 are extended from FrameNet 1.5 and thus are more complete. Note train, dev and test documents in both data have been partitioned following (Swayamdipta et al., 2017). Given a sentence in documents may contain multiple target words, we regard it as multiple pairs of target word and sentence in train, dev and test sets. The statistics of two datasets are illustrated in Table 2.
To test the model's performance on the more challenging ambiguous data, following the previous studies, we constructed a specialized dataset by extracting pairs of target word and annotated sentence from test data, in which the target words are polysemous or can evoke multiple frames.

Baselines
We first compare the KGFI against five existing models. Semafor ) is a conditional log-linear model which uses statistical features about target word to predict the frame label. Hermann-14 (Hermann et al., 2014) is a joint learning model which maps frame labels and the dependency path of target word into a common embedding space. SimpleFrameId (Hartmann et al., 2017) models a classifier based on the embeddings of entire words in the sentence. Open-Sesame (Swayamdipta et al., 2017) models a classifier based on bi-directional LSTM. Hermann-14 converts frame labels into onehot embeddings, while other models treat frame labels as discrete supervision signals. Peng's model (Peng et al., 2018) is a joint learning model for FI and FSRL, which both uses exemplars in FrameNet knowledge base and the full-text annotation training data to train the model.
In addition, we also implemented two additional Bert-based baselines for fair comparison. One is called Bert-cls that uses Bert to represent the target word in a sentence and treats discrete frame labels as supervision signals. The other is called Bertonehot, which also uses the dual-encoder architecture (Context encoder and frame encoder) and maps target words and frames into a common embedding space. The difference between KGFI and Bert-onehot is that KGFI uses GCN-based modules to incorporate frame knowledge into frame embeddings, while Bert-onehot uses a linear network to map onehot vector of frame labels into frame embeddings without incorporating knowledge. Clearly, we will test if the knowledge plays a significant role for better frame embeddings and subsequent FI task.

Parameter Settings
All Bert modules in KGFI were initialized with Bert-base. We set both the dimensions of target word embedding r t and frame embedding r f to 128 (m=128), the hidden layer size of FEsGCN and DefGCN to 256 (h=256). The size of Bert embedding is n=768. The dimensions of FEs and FRs feature vectors are related to FrameNet version (see Table 2). For optimization, we use BertAdam optimizer and set learning rate to 5e − 5. As for parameter tuning, our parameters are tuned using the development set with the early stop strategy.

Test Settings
FrameNet has a few coverage issues in that: (1) the LUs set is incomplete for some frames; (2)

Overall Results
The overall testing results, as shown in Table 3, demonstrate that Bert-cls and Bert-onehot are two strong baselines, outperforming all of the prior work that does not incorporate pre-training modules into their systems. Bert-onehot slightly outperforms Bert-cls in all of the testing settings, indicating joint learning target word embedding and frame embedding is helpful for FI task. Our best KGFI models, including KGFI (2layers) for FrameNet1.7 and KEFI (1-layer) for FrameNet1.5, outperform all the baseline models of FI in terms of accuracy. Compared with the stronger Bert-onehot model, our model achieves absolute 1.83% and 0.67% improvements on two datasets respectively in All test setting. With the help of lexicon filtering with LUs in FrameNet, the model predicts the exact frame evoked by the target word among a small set of candidate frames. Clearly, the improvements are credited to the model's performance improvement in predicting frames for ambiguous target words, since the model achieves absolute 3.75% and 1.56% improvements in Amb test setting on two datasets respectively.
To the best of our knowledge, few previous work focus on frame prediction without lexicon filtering   except for SimpleFrameId model, so we choose SimpleFrameId and the stronger Bert-onehot model as our baseline to compare our best model's performance under no-lexicon filter setting. As shown in Table 4, in comparison with the stronger Bert-onehot model, our model achieves absolute 5.72% and 3.63% improvements on two datasets respectively in all setting (without using LUs and compared with more than 1000 frames), signifying the generalization performance of our model achieves significant improvement, considering that the model predicts the exact frame evoked by the target word among all the frames without knowing the possible candidate frames of the target word in no-lexicon filtering setting.
To further test the performance of our best KGFI model, we use the top-K accuracy to measure the model performance without lexicon filtering. The higher top-K accuracy indicates that the model has learned better frame representations. Furthermore, the model can reduce the candidate frame set into a small frame subset (containing K most possible frames), which is useful for the downstream tasks, such as LUs induction for FrameNet, FSRL, etc. As shown in Table 5, compared with Bert-onehot baseline, our best KGFI model achieves higher top-K (K=1,2,3,5) accuracy, which further demonstrates the model has learned the better frame representation through incorporating the frame knowledge.

Models
All-L All-nL  Table 6: Ablation analysis on FrameNet1.7 dataset in All-L and All-nL setting. The sign 'w/' and 'w/o' denote that the KGFI is constructed with and without the corresponding module respectively. '-L' and '-nL' denote testing with and without lexicon filtering respectively.
Considering FrameNet1.5 dataset is relatively small, the performance of simple structure model (using 1-layer GCN) achieves the best performance, while the performance of the model using 2-layers GCN drops slightly. In general, no matter how many layers are adopted, our models outperform all the baselines and achieve the best performance on two datasets in all settings consistently.

Ablation Studies
To test the function of each component in KGFI, we conduct the ablation study. As shown in Table 6, the results demonstrate that all of the three components, i.e. DefGCN, FEsGCN and attention network, are helpful for enhancing the model's performance. Even with DefGCN or FEsGCN individually, the performance of our model is still better than the stronger baseline Bert-onehot, which indicates the frame definition, FEs and FRs are all useful knowledge for frame representation, and our proposed GCN-based model architecture is effective to incorporate them into the informative embeddings. Compared with frame definition, FEs are more useful for frame representation, since the performance of GKFI (with FEsGCN) outperforms KGFI (with DefGCN), although it slightly lags behind KGFI full model (with FrameGCN). Note that the attention module is removed when DefGCN or FEsGCN is used as the frame encoder.
As for the attention module, the performance of KGFI (with FrameGCN) drops when we replace it with a simple addition operation, suggesting it is necessary to use attention mechanism to integrate the outputs of DefGCN and FEsGCN.

Weighting Method for Adjacent Matrix
To test the rationality of our proposed weighting method for adjacent matrix A, we conduct a set of  comparison experiments, in which the weighted matrix is replaced with a binary matrix. Binary matrix is widely used approach to express the relations between nodes in graph modeling. Our weighting method expresses the hierarchy relationships between frames straightforwardly. The results demonstrate that the weighted method has significant impact on the model's performance, and our proposed weighting method for adjacent matrix is quite reasonable, since the performance of all the models using weighted matrix outperforms their counterparts using binary matrix, shown in Table 7. Figure 4 shows that KGFI (w/FEsGCN) model tends to predict correct frame by finding the semantic relatedness between FEs and the context of target word. For instance, in sentence 1), the target word stopped may evoke Activity stop or Process stop, and the phrase the fighting is the key to distinguish two frames evoked by the word stopped, since these two frames differ in that the subject of stopped is an Agent or a Process. Our KGFI(w/FEsGCN) model has learned the semantic relation between the fighting phrase and FE Process, and outputs the correct frame, since FE Agent is related to an entity in general. The Bertonehot model can't grasp this relation, so it outputs a wrong prediction Activity stop. On the other hand, the KGFI(w/ DefGCN) model tends to predict the frame with the semantic similarities between frame definition and the sentence. For instance, in sentence 2), the word Traversing in definition is similar to phrase passed through, so the model outputs the correct frame Traversing.

Case Studies
In sentence 3), the KGFI(w/ DefGCN) model outputs a wrong prediction Quitting a place due 1) The fighting has stopped for more than two years . 2) Steve passed through the Rome airport customs?
Traversing: A Theme changes location with respect to a salient location.
3) Ferries depart from Central to Silvermine Bay . to the similar meaning of the word depart in the sentence and the word leaves in the frame definition (Quitting a place: a Self mover leaves an initial Source location.). The KGFI(w/ FEsGC-N) model, on the other hand, has learned that the word Ferries in the sentence is more closely related to FE Theme of frame Departing (Departing: a Theme moves away from a Source.) rather than FE self mover of frame Quitting a place, and outputs the correct frame Departing, since the self mover generally refers to a living object (e.g. a person, an animal). Note that the frame Departing is inherited by the frame Quitting a place, so they have nearly the same FEs set except for FE Theme and FE self mover. In other words, our KGFI(w/ De-fGCN) and KGFI(w/ FEsGCN) are complementary to each other to some extent. KGFI(w/ FEsGCN) can capture the subtle differences between different frames, even if the frames have close frame-toframe or semantic relations.
The case studies show that KGFI models can incorporate frame knowledge into its representations and guide the context encoder to learn the semantic relations between frames and the context-aware representations of target words and frames through joint learning.

Related work
Researchers have made great effort to tackle the FI problem since it has been proposed in the Semeval-2007 (Baker et al., 2007). It is generally regarded as a classification task. The best system (Johansson and Nugues, 2007) in the SemEval-2007 adopted SVM to learn the classifier to identify frames with a set of features, such as target lemma, target word, and so on. SEMAFOR  uti-lized a conditional model that shares features and weights across all targets, frames, and prototypes. These approaches use manually designed features and traditional machine learning methods to learn the classification models, while the class labels as supervision signals are discrete frame names.
Recently, distributed feature representation and models based on neural network are used to tackle FI. According to the model architecture, there are two trends of work. One is joint learning approach that converts the discrete frame labels into continuous embedding by learning the embeddings of target words and frames simultaneously. For instance, Hermann-14 (Hermann et al., 2014) implemented a model that jointly maps possible frame labels and the syntax context of target words into the same latent space using the WSABIE algorithm, and the syntax context was initialized with concatenating their word embeddings. SimpleFrameId (Hartmann et al., 2017) useed the concatenation of SentBOW (the average of embeddings of all the words in the sentence) to represent the context and then learns the common embedding space of context and frame labels following the line of (Hermann et al., 2014). The other trend is to construct the classifier model using deep neural network and regard discrete frame labels as supervision signals, which is similar to those earlier work. Open-Sesame (Swayamdipta et al., 2017) used a bidirectional LSTM to construct the FI classifier. Peng (Peng et al., 2018) proposed a joint learning model for FI and FSRL, which adopted a multitask model structure.
Different from previous studies, this paper focuses on how to represent frames by incorporating frame knowledge into frame representations and enriching frame labels with semantic information.

Conclusion
In this work, we propose a novel idea that leverages frame knowledge, including frame definition, frame elements and frame-to-frame relations, to improve the model performance of FI task. Our proposed KGFI framework mainly consists of a Bertbased context encoder and a GCN-based frame encoder which can effectively incorporate multiple types of frame knowledge in a unified framework and jointly map frames and target words into the same semantic space. Extensive experimental results demonstrate that all kinds of knowledge about frames are useful for enriching the representation of frames, and the better frame representation is helpful for FI task. The experimental results also show that the proposed model achieves significantly better performance than seven state-of-the-art models across two benchmark datasets.