Exploiting Definitions for Frame Identification

Frame identification is one of the key challenges for frame-semantic parsing. The goal of this task is to determine which frame best captures the meaning of a target word or phrase in a sentence. We present a new model for frame identification that uses a pre-trained transformer model to generate representations for frames and lexical units (senses) using their formal definitions in FrameNet. Our frame identification model assesses the suitability of a frame for a target word in a sentence based on the semantic coherence of their meanings. We evaluate our model on three data sets and show that it consistently achieves better performance than previous systems.


Introduction
Research on frame semantics has grown within the fields of natural language processing and cognitive science since the 1970s as the study of how we associate words and phrases with cognitive structures called frames, which characterize a small abstract scene or situation (Fillmore, 1976(Fillmore, , 1982. The Berkeley FrameNet project (Baker et al., 1998) provides an online lexical database for frame semantics together with a corpus of annotated documents. Frame semantic parsing is the task of automatically extracting frame semantic structures from sentences. The process typically consists of three steps: target identification, which identifies frame-evoking predicates in the sentence; frame identification, which identifies the evoked frame for each target; and argument identification, which identifies arguments of a frame and labels them with semantic roles (frame elements). In this work, we focus on the frame identification problem.
FrameNet 1.7 contains more than 13,000 lexical units (a word lemma with a sense), each associated with a semantic frame. A polysemous word is associated with multiple lexical units (one for each sense), and is therefore linked to multiple frames. The frame identification task requires a system to identify the most relevant frame for a target word or phrase based on its sentence context. Here is an example: The pandemic has sparked a lot of problems for the economy.
Given the target word sparked, the goal is to determine which frame should be triggered. The word lemma spark has two senses in FrameNet: "with obj. ignite" and "provide the stimulus for". The former sense is associated with the Setting fire frame and the latter one is associated with the Cause to start frame. The Setting fire frame is defined as "this frame describes the creation of a flame...", and the Cause to start frame is defined as "a cause, animate or inanimate, causes a process, the effect, to begin". So Cause to start is the correct frame for this sentence.
Previous work has shown the success of using feature engineering with linear classification models (Johansson and Nugues, 2007) and discriminative probabilistic models (Das et al., 2010), which were later improved by applying distributed word representations and deep neural network models (Hermann et al., 2014). Syntactic information, typically dependency paths, has consistently played an important role in frame identification Peng et al., 2018).
Our work is motivated by the rich lexicographic information about frames and lexical units provided by the FrameNet database, which has not been fully utilized for the frame identification task. Recent advances in large pre-trained transformer models (Devlin et al., 2019) have demonstrated the ability to capture semantic meaning in dictionary definitions for the related problem of word sense disambiguation (Huang et al., 2019;Blevins and Zettlemoyer, 2020).  Figure 1: Overview of the FIDO architecture. Each green block represents a different candidate pair (lexical unit, frame) for the same Target i .
Our model uses the definitions of frames and lexical units in FrameNet as a source of knowledge to help assess the semantic coherence between the target word and candidate frames. Specifically, we utilize the contextual embeddings produced by the BERT (Devlin et al., 2019) model to determine if a candidate lexical unit and frame express the same meaning as the target word in the given context. Our model achieves state-of-the-art performance on two FrameNet datasets and a FrameNetannotated dataset based on Yahoo! Answers. Our code is open-source and available online 1 .

Related Work
There has been considerable work on the frame identification problem with respect to FrameNet, especially since the SemEval 2007 shared task (Baker et al., 2007). Johansson and Nugues (2007) used a SVM classifier to disambiguate frames with hand-crafted features. Das et al. (2010) applied feature-based discriminative probabilistic (log-linear) models for frame identification. Hermann et al. (2014) presented a method using distributed representations of predicates and their syntactic context by mapping input representations and frame representations to a common latent space using the WSABIE algorithm (Weston et al., 2011). Hartmann et al. (2017) built a simplified model based on Hermann et al. (2014) and achieved comparable results. They also released a new FrameNet-annotated test set based on user-generated web text from Yahoo! Answers. Yang and Mitchell (2017) integrated a bidirectional LSTM neural network and a relational network to jointly decode frames.
More recently, Botschen et al. (2018) brought in multimodal representations grounded in images to improve frame identification. Peng et al. (2018) proposed a joint inference formulation that learns semantic parsers from multiple datasets.
In contrast to the previous models, our model does not rely on syntactic features. We assess semantic coherence directly from the input sentence and definitions in FrameNet.
Another line of related work is learning embeddings from dictionary definitions. It has been shown that neural networks can extract semantic information from dictionary definitions (Kumar et al., 2019;Bosc and Vincent, 2018). Recent work in word sense disambiguation (Huang et al., 2019;Blevins and Zettlemoyer, 2020) has demonstrated that providing pre-trained language models with sense definitions (glosses) can be effective. Yong and Torrent (2020) also used the sense definitions of lexical units for their research on frame induction. Our model adopts a similar architecture as Huang et al. (2019), but we focus on the frame identification task and we explore the use of both lexical unit and frame definitions for this task.

Method
Given a sentence and a target word or phrase, the frame identification task assigns the most relevant frame to the target according to the sentence context. Figure 1 shows the framework of our model called FIDO (Frame Identification with DefinitiOns). Our system takes the sentence and the definitions of associated lexical units (senses) and their frames as input to the BERT model, as indicated by the green blocks. Each green block represents the target word in the sentence, one of its senses, and that sense's associated frame in FrameNet. Then we use the output vectors to produce a probability distribution over all of the candidate frames. We select the frame with the maximum probability as the answer.

Notation
We denote the ith example (i = 1, 2, ..., n) consisting of a sentence and designated target word or phrase as s, t i , its correct frame as f * i , the set of lexical units associated with the target as l 1 i , l 2 i , ..., l m i i , and their corresponding frames as . We seek to estimate the probability of the jth frame being the correct frame by: where g(·) is a function produced by our model for scoring the assignment of a frame to the sentence and target. We use negative log likelihood as our loss function: where n is the total number of training examples.

Modeling
FrameNet provides unique definitions for each lexical unit (LU) and frame. A LU is a pairing of a word lemma and a meaning (sense). Determining the correct LU (sense) uniquely determines the correct frame because each sense of a polysemous word is linked to a different frame. For example, the word cut can trigger different frames depending on its meaning (the definition sentences follow the bold lexical unit or frame names), as shown in We use the BERT (Devlin et al., 2019) model as the base of our architecture to produce the function g(·) as described in Eq (1). For each target, first we extract LUs from FrameNet that have the same lemma and their corresponding frames to form a set of candidate (LU, Frame) pairs. Our goal is to predict whether the target in the sentence has the same meaning as the definitions of a candidate LU and its associated frame.
As input to the BERT model, we use the sentence as the first sequence and concatenate a LU definition and frame definition as the second sequence. Each definition starts with the LU name or frame name and a colon, followed by the definition description.
Instead of using the output vector of the [CLS] token as is typical, we use the last hidden vector of the target word as output (if there is more than one token, we only use the first one). By passing the output vector through a linear layer, we then get a score for assigning a candidate frame to the sentence and target. Finally the scores for all candidate frames are passed through the softmax function to get the probabilities in Eq (1).

Datasets
FrameNet: To compare FIDO with previous systems, we evaluate our model on FrameNet (FN) 1.5 using the same train/dev/test data split as . We also evaluate our model on FN 1.7 which has been available since 2016 and contains nearly 20% more gold annotated data than FN 1.5. We use the same data split as Swayamdipta et al. (2017) et al. (2017) 87.6 Yang and Mitchell (2017) 88.2 Open-SESAME (2017) 86.9 Botschen et al. (2018) 88.8 Peng et al. (2018) 90.0 FIDO 91.3

Training Details
We use the pre-trained uncased BERT BASE model with the same settings as Devlin et al. (2019) and fine-tune on our training data. We set the max sequence length as 300, batch size as 16, learning rate started at 2e-5, and train for 5 epochs. All reported results are averaged over 3 runs with random seeds. Table 3 (2017) integrates a sequential and relational network for joint learning. Peng et al. (2018) has achieved the best prior results on frame identification using a multitask approach to learn semantic parsers from disjoint corpora. It is worth noting that besides the FN 1.5 training set, they also use 153,952 exemplar sentences for training, which is more than 10 times the size of our training data. FIDO achieves better performance than all of the prior systems. Table 4 shows our results compared to Peng et al. (2018) on FN 1.7. FIDO achieves a 3.0% absolute accuracy gain on this data set. The YAGS data set (Hartmann et al., 2017)   that do not have related LUs in FN 1.5 and also unlinked targets (i.e., the provided gold frame does not belong to the set of frames associated with this target in FN). Our model is not able to make a correct prediction for these cases based on its design. There are 122 unknown or unlinked targets in the test set, on which our model will get a zero score. Despite this limitation, our model still outperforms Hartmann et al. (2017), which demonstrates its ability to generalize across text genres.

Analysis
We performed an ablation study to assess the contributions of each part of our model. In Table 5, the first row shows the results for our complete FIDO model. Rows 2-3 show results when using only the definitions of frames (FRdef only) or LUs (LUdef only). We see that the frame definitions contribute the most to performance. Using the LU definitions alone on FN 1.7 also achieves quite good results. But combining both definitions together yields better results than either one alone. In order to tease apart the impact of the definitions from the impact of BERT, we did an experiment replacing each definition simply with the name of the frame or LU. These results appear in the FIDO (NO def) row. Removing the definitions results in a large performance drop. The definitions clearly play a major role.
In the bottom row, we show the results of experiments using the output vector of the [CLS] token (all other settings the same), which did not perform as well as using the target token. This is not surprising as [CLS] aggregates the entire sequence representation rather than focusing on the target.
Previous work also reported accuracy on ambiguous cases (i.e., when the target word is associated with multiple frames), which more directly shows the model's ability to disambiguate frames. However, the set of ambiguous targets is different across papers. To avoid comparing apples and oranges,  we report accuracy on two different sets of ambiguous targets. In Table 6, the Amb1 column follows Peng et al. (2018), which uses the gold LU's partof-speech (POS) tag to form the candidate frame list. In this setting, if a target has just one sense when its POS is known, it is not considered to be ambiguous. Our model outperforms Peng et al. (2018) on both FN 1.5 and FN 1.7 datasets. The Amb2 column shows the accuracy of ambiguous targets using only the lemma of the target (i.e., not relying on gold POS tags). We encourage future work to articulate which setting is used. We also analyzed whether unseen frames and unseen targets were a major source of errors for our model. On FN 1.7, our FIDO model achieved 92.1% accuracy, so it mislabeled 7.9% of the test cases. We found that 1.4% of the test cases were mislabeled and had an unseen frame (i.e., the gold frame was not seen with the target in the training data), and 0.52% of the test cases were mislabeled and had an unseen target (i.e., the target was not seen in the training data). Therefore only about 1/4 of the FIDO errors were due to unseen frames and unseen targets. We conclude that even for frames and targets that appear in the training data, there is still substantial room for improvement on this task.

Conclusion
We tackled the frame identification problem by assessing the semantic coherence between the meaning of a target word in a sentence, and a candidate frame. Our model exploits the frame and lexical unit definitions provided by FrameNet and a pretrained transformer model to generate semantic representations. The experiments show that this model achieves better performance than previous systems on two versions of FrameNet data and the YAGS dataset. Our work has demonstrated that a relatively simple architecture that brings together pre-trained language models with frame and sense definitions can produce a highly effective system for frame identification.