Semantic Frame Induction with Deep Metric Learning

Recent studies have demonstrated the usefulness of contextualized word embeddings in unsupervised semantic frame induction. However, they have also revealed that generic contextualized embeddings are not always consistent with human intuitions about semantic frames, which causes unsatisfactory performance for frame induction based on contextualized embeddings. In this paper, we address supervised semantic frame induction, which assumes the existence of frame-annotated data for a subset of predicates in a corpus and aims to build a frame induction model that leverages the annotated data. We propose a model that uses deep metric learning to fine-tune a contextualized embedding model, and we apply the fine-tuned contextualized embeddings to perform semantic frame induction. Our experiments on FrameNet show that fine-tuning with deep metric learning considerably improves the clustering evaluation scores, namely, the B-cubed F-score and Purity F-score, by about 8 points or more. We also demonstrate that our approach is effective even when the number of training instances is small.


Introduction
Semantic frames are knowledge resources that reflect human intuitions about various concepts such as situations and events. One of the most representative semantic frame resources is FrameNet (Baker et al., 1998;Ruppenhofer et al., 2016), which consists of semantic frames, lexical units (LUs) that evoke these frames, and collections of frameannotated sentences. Semantic frame induction is the task of grouping predicates, typically verbs, according to the semantic frames they evoke. 1 For example, given the verbs in the example sentences Frame Example sentence

FILLING
(1) She covered her mouth with her hand.
(2) I filled a notebook with my name.

PLACING
(3) You can embed graphs in your worksheet.
(4) He parked the car at the hotel.

REMOVING
(5) Volunteers removed grass from the marsh.
(6) They'd drained the drop from the teapot. TOPIC (7) Each database will cover a specific topic.
(8) Chapter 8 treats the educational advantages. (1) (a) Vanilla BERT (b) Fine-tuned BERT w/ AdaCos  Table 1. listed in Table 1, semantic frame induction aims to group them into four clusters according to the frames that they evoke. Recent studies Anwar et al., 2019;Ribeiro et al., 2019) have demonstrated the usefulness of contextualized word embeddings such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) in unsupervised semantic frame induction. Figure 1(a) shows a 2D t-SNE (Maaten and Hinton, 2008) projection of the vanilla BERT 2 embeddings of verbs extracted from frameannotated sentences in FrameNet. We can confirm that the instances of the verb "cover" in Exam-ples (1) and (7) are far apart in the space, which reflects the distance between their meanings. In contrast, "cover" in (7) and "treat" in (8), which are annotated with the same TOPIC frame, are close together. However, instances are not always properly placed in the semantic space according to the frames they evoke. For example, "remove" in (5) and "drain" in (6) are annotated with the same RE-MOVING frame but are not close together. This suggests that contextualized word embeddings are not always consistent with human intuition about semantic frames.
Hence, in this study, we tackle supervised semantic frame induction, which assumes the existence of annotated data for certain predicates, to induce semantic frames that adequately reflect human intuition about the frames. We propose methods that use deep metric learning to fine-tune the contextual word embedding model so that instances of verbs that evoke the same frame are placed close together and other instances are placed farther apart in the semantic space. Figure 1(b) shows the 2D t-SNE projection of BERT embeddings after fine-tuning with AdaCos (Zhang et al., 2019), which is a representative deep metric learning method. We can confirm that predicates that evoke the same frame are close together, such as those in (3) and (4) and those in (5) and (6). This suggests that deep metric learning enables fine-tuning of BERT to obtain embedding spaces that better reflect human intuition about semantic frames.

Related Work
For automatic construction of semantic frame resources, studies on grouping predicates according to the semantic frames they evoke can be divided into two groups: those that work on semantic frame identification, in which predicates are classified into predefined frames; and those that work on semantic frame induction, in which predicates are grouped according to the frames that they evoke, which are typically not given in advance.
Semantic frame identification is often treated as a subtask of frame semantic parsing (Das et al., 2014;Swayamdipta et al., 2017), and the methods using contextualized embedding have become mainstream. For example, Jiang and Riloff (2021) used a BERT-based model to generate representations for frames and LUs by using their formal definitions. Su et al. (2021) used a BERT-based model with a context encoder, to encode the context surrounding the frame-evoking word, and a frame encoder, to encode the frames' definitions and semantic roles. Yong and Torrent (2020) treated semantic frame identification as a clustering task. 3 They first excluded predicates that evoke frames that are not included in FrameNet by applying an anomaly detection model; then, they grouped the remaining predicates according to their meanings by using contextualized embeddings of predicates and sentence embeddings of the frame definitions.
Semantic frame induction is the task of grouping predicates in texts according to the frames they evoke. Instead of frames being given in advance, each grouping of given predicates is considered a frame. As with semantic frame identification, methods using contextualized embedding have become mainstream.  first performed agglomerative clustering by using the BERT embedding of a frame-evoking verb and then split each cluster into two on the basis of the verb's substitutes. Anwar et al. (2019) used the embedding of the frame-evoking verb and the average word embedding of all the words in a sentence, as obtained by skip-gram (Mikolov et al., 2013) or ELMo and then performed agglomerative clustering. Ribeiro et al. (2019) applied graph clustering by using the ELMo embedding of the frame-evoking verb. Yamada et al. (2021a) leveraged the embedding of the masked frame-evoking verb and performed twostep clustering, which comprised intra-verb and cross-verb clustering. Yamada et al. (2021b) investigated how well contextualized word representations can recognize the difference of frames that the same verb evokes, and explored which types of representation are suitable for semantic frame induction. All of these studies focused on unsupervised semantic frame induction, with no training data. In contrast, in this study, we assume the existence of frame-annotated data for a subset of predicates appearing in a corpus, and work on supervised semantic frame induction.

Task Description
The task of supervised semantic frame induction assumes the existence of frame-annotated data for a subset of a corpus's predicates, and it aims to build a frame induction model that leverages the annotated data. Clustering-based methods are generally used for semantic frame induction, and this is also true for supervised semantic frame induction, where the annotated data is used to learn the distance metric for clustering. In this study, the predicates that are used for training the metric and for testing do not overlap. Note that, because different predicates may evoke the same frame, instances in the test data include predicates that evoke frames that are present in the training data.

Baseline Methods
For the simplest baseline, we use a one-step clustering-based method with contextualized embedding. The clustering method is group-average clustering based on the Euclidean distance. We also leverage the masked word embeddings and twostep clustering proposed by Yamada et al. (2021a). Regarding the former, we use a weighted average embedding (v w+m ) of the standard contextualized embedding of the frame-evoking word (v word ) and the masked word embedding (v mask ), which is a contextualized embedding of the frame-evoking word replaced by a special token " [MASK]." The embedding v w+m is defined using a weight parameter α as follows: (1) Two-step clustering performs clustering for each frame-evoking word with the same lemma, 4 and it performs clustering over different frame-evoking words in the second step. We use X-means (Pelleg and Moore, 2000) for the first step and groupaverage clustering based on the Euclidean distance for the second step. All other settings here are the same as in Yamada et al. (2021a).

Fine-tuning by Deep Metric Learning
For supervised semantic frame induction, we finetune contextualized word embedding models by applying deep metric learning (Kaya and Bilge, 2019; Musgrave et al., 2020) so that the instances of predicates that evoke the same frame are closer together and those of predicates that evoke different frames are further apart. We apply two representative deep metric learning approaches: a distance-based approach and a classification-based approach.
Distance-based Approach This is a classical deep metric learning approach, and the models typically use multiple encoders to train the distance between a pair of instances. In this approach, we use two losses, a contrastive loss and a triplet loss, to build frame induction models.
The contrastive loss (Hadsell et al., 2006) is used to train the distance between a pair of instances by using a network of two encoders with shared parameters. Specifically, the model is trained to keep instances of the same class close together and instances of different classes separated by a certain margin. The loss is defined as follows: where x i denotes an embedding of an instance belonging to the i-th class, m denotes a margin, and D denotes a distance function, which is generally the squared Euclidean distance. The triplet loss (Weinberger and Saul, 2009) is used for training such that, for a triplet of instances, the distance between the anchor and negative instances, which are from different classes, is more than a certain margin greater than the distance between the anchor and positive instances, which are from the same class. The loss is defined as follows: where x a , x p , and x n denote embeddings of the anchor, positive, and negative instances, respectively, and m and D are the same as in Equation (2).
We create pairs for each instance in the training set by randomly selecting instances of predicates that evoke the same frame as positives and instances of predicates that evoke different frames as negatives. The margin to keep the negatives away is determined by the development set.
Classification-based Approach This is an approach that has recently become the standard for face recognition. It basically uses a network that has an encoder to obtain instance embeddings and a linear layer for multiclass classification. This is superior to the distance-based approach in that it does not require a sampling algorithm and saves memory because it uses only a single encoder. The loss function is based on the softmax loss: where x i , w i , and b i denote an embedding of the instance, the linear layer's weight, and a bias term, respectively, for the i-th class, and n denotes the number of classes. Many losses used in face recognition have been adjusted by introducing different margins for the softmax loss (Liu et al., 2017;Wang et al., 2018;Deng et al., 2019). These losses typically remove the bias term b i of the softmax loss and transform the logit as w ⊤ i x i = ||w i || · ||x i || · cos θ i , where θ i is the angle between w i and x i . ArcFace (Deng et al., 2019) has become a popular choice because of its superior geometric interpretation. It applies l 2 regularization to w i and x i and introduces an angular margin m and a feature scale s as hyperparameters to simultaneously enhance the intra-class compactness and inter-class discrepancy. The Arc-Face loss is defined as follows: Zhang et al. (2019) pointed out that the performance of these losses depends on the hyperparameters and they observed the behaviors of the angular margin and the feature scale. As a result, they proposed the hyperparameter-free AdaCos loss, which removes the margin and applies the scale dynamically. The AdaCos loss is defined as follows: wheres denotes the automatically tuned scale. While the softmax and AdaCos losses do not require a hyperparameter search, ArcFace requires hyperparameters for the margin and feature scale. Here, we explore only the margin because Zhang et al. (2019) showed that the behavior of the margin and the scale are similar and the distance-based approach explores the margin.

Experiment
To evaluate the usefulness of fine-tuning with deep metric learning, we experimented with supervised semantic frame induction by comparing previous non-fine-tuned models to various fine-tuned models ranging from typical to evolved ones. By varying the number of training instances, we also verified that our models were effective even for training a small number of instances.

Settings
Dataset The dataset in our experiments was created by extracting example sentences in which the frame-evoking word was a verb from the FrameNet 1.7 dataset. 5 These example sentences were split into three sets such that sentences with the same verb were in the same set. The proportions of polysemous verbs were equal. We performed three-fold cross-validation with the three sets as the training, development, and test sets. Table 2 lists the dataset statistics. Note that the verbs, LUs, and instances did not overlap among the sets, but the frames did overlap. The training set was used to fine-tune the contextualized word embeddings. The development set was used to determine the criterion for the number of clusters and the weight α of the embedding v W+M , as well as the margin for the contrastive, triplet, and ArcFace losses. The range of α was from 0 to 1 in increments of 0.1, and the candidates of the margin were 0.1, 0.2, 0.5, and 1.0 for the contrastive and triplet losses and 0.01, 0.02, 0.05, and 0.1 for the ArcFace loss.
Comparison Methods We used BERT 6 from Hugging Face (Wolf et al., 2020) to obtain contextualized word embeddings. We compared 12 methods, which comprised the vanilla model (Vanilla) and five fine-tuned models (Contrastive, Triplet, Softmax, ArcFace, AdaCos) with one-step clustering and two-step clustering. All embeddings were processed with l 2 normalization. Regarding hyperparameters, the batch size was 32, the learning rate was 1e-5, and the number of epochs for fine-tuning was five. Also, the feature scale for ArcFace was 64. The optimization algorithm was AdamW (Loshchilov and Hutter, 2017). We compared our methods with the three unsupervised methods used in Subtask-A of SemEval-2019 Task

Results
Table 3 summarizes the experimental results with the 12 methods. The fine-tuned models, especially the Triplet, ArcFace, and AdaCos models, obtained higher PIF and BCF scores than the Vanilla model except for the two-step clustering method with the Contrastive model. The reason why the Contrastive model performed worse than the other fine-tuned models could be that the space that represents the frame does not match the cluster size due to train the distance according to a fixed margin. Table  4 also lists the comparison results with previous methods. Our fine-tuning methods achieved higher PIF and BCF scores than the previous methods. These results indicate that fine-tuning with deep metric learning helps to improve the performance of semantic frame induction.
From Table 3, we can see that the two-step clustering methods tended to obtain higher overall scores than the one-step clustering methods. How-  ever, the difference in BCF scores between the one-step and two-step clustering methods with the Vanilla model was 13.9, whereas the difference in the maximum BCF scores for both clustering methods with the fine-tuned models was only 2.3. Thus, for fine-tuning models, one-step clustering is still a good option in addition to two-step clustering. Note that one-step clustering is more straightforward to implement than two-step clustering, but it requires more computation time 8 and CPU memory to cluster many instances at once. Regarding the weight α, two-step clustering tended to incorporate v mask more than one-step clustering did in not only the Vanilla model but also the fine-tuned models. These results suggest that two-step clustering remains effective in masking a verb's surface information even after fine-tuning. The balance between BCP and BCR in the clustering evaluation metric depends on the final number of frame clusters, #C in Table 3. In the extreme case, BCR is 1 if #C is 1, and BCP is 1 if #C is equal to the number of instances. Hence, among models with roughly the same BCF, those with a fewer number of clusters tend to have higher BCR. For example, as shown in Table 3, #C of Triplet in two-step clustering is 454, while that of AdaCos is 656, and we can confirm that Triplet, which has fewer clusters, obtains higher BCR than AdaCos. 8 In our experiments with 16-core Intel Xeon Gold 6134 CPU at 3.20 GHz, the computation times were about 10 minutes for one-step clustering and about 5 minutes for two-step clustering.

Effect of Number of Training Instances
We found that fine-tuned methods outperformed previous unsupervised methods when the number of training instances was around 30,000. However, the annotation cost of building a resource like FrameNet is high, so the fewer instances used for training, the easier it is to build other language resources and apply them to other tasks. Thus, we experimented with varying the number of training instances. Specifically, for each LU in the training set, the maximum number of instances was varied among 1, 2, 5, 10, and all instances. The resulting average numbers of training instances for the three sets were 1,273, 2,445, 5,680, 10,053, and 27,536, respectively. The numbers of verbs, LUs, and frames were the same in each setting. Table 5 lists the PIF and BCF scores for each method. Because the Vanilla model was not finetuned, its scores are the same in each setting. The Triplet model achieved high scores even with a small number of training instances. In the two-step clustering method with the Triplet model, the score difference between the cases of "1" and "all" is only 3.1 for PIF and 3.6 for BCF, even though the number of training instances is quite different, i.e., 1,273 vs. 27,536. These results show that even when a small number of examples is annotated for each meaning of a verb, this method can be expected to perform considerably better than unsupervised methods. In contrast, the Softmax, ArcFace, and AdaCos models obtained scores closer to the Triplet model in the cases of "5" or "10" but per-formed considerably worse with an even smaller number of training instances. We conclude that the relatively poor performance of these models with a small number of training instances was due to insufficient training of the linear layer's weights.

Analysis of Fine-tuned Embedding
It is not easy to analyze the properties of an embedding in clustering evaluation because the performance depends on the clustering method and the number of clusters. To better understand the fine-tuned embeddings, we performed a similarity ranking evaluation and visualized the embeddings.

Similarity Ranking Evaluation
We evaluated the models by ranking instances according to their embedding similarity. Specifically, we took one verb instance as a query instance; then, we computed the cosine similarity of the embeddings between the query instance and the remaining verb instances and evaluated the similarity rankings of the instances in descending order. We used v w+m with the same weight α that was used for the one-step clustering in Section 4. We chose recall as the metric to evaluate the instance distribution. This metric computes the average matching rate between true instances, which are instances of the same frame as the query instance, and predicted instances, which are obtained by extracting the same number of top-ranked instances as the number of true instances. For example, Set 1 of Table 2 had 153 instances of the FILLING frame out of 28,314 total instances. When one of these instances was the query instance, the number of true instances would be 152. Thus, from the total instances, we would extract the top 152 instances that were similar to the query, and if 114 instances were true instances, the score would be 114/152 = 0.75.
We performed the similarity ranking evaluation in three settings with respect to the search space of the ranked instances: ALL, which included all instances, SAME, which included only instances of the same verb as the query, and DIFF, which included only instances of different verbs as the query. Table 6 lists the results. The results for ALL show that all of the fine-tuned models were improved over the Vanilla model; in particular, the four fine-tuned models besides the Contrastive model performed very well, improving by more than 20. We thus confirmed that instances of the same frame were trained to be close to each other    Table  6 in the OVERLAP and NON-OVERLAP cases. and instances of different frames are trained to be distant from each other. Score improvements were observed for both SAME and DIFF, and as expected, SAME scored higher than DIFF both before and after fine-tuning. However, the improvement was much larger for DIFF than for SAME, suggesting that the improvement in clustering performance by fine-tuning was mainly due to the fact that different verbs evoking the same frame were trained to be close to each other. It is important to further examine whether the improved performance might have resulted only from the frames included in the training set. That is, we need to verify that the embedding of an instance of an untrained frame could be associated with a correct frame. To investigate this, we aggregated the scores separately for cases in which the frames of the query instance were included in the training set (OVERLAP) and for cases in which they were not (NON-OVERLAP). Table 7 lists the separately aggregated results for ALL in Table 6. All of the fine-tuned models obtained higher scores than the Vanilla model for not only frames that were in the training set but also frames that were not. Note that the scores for NON-OVERLAP were higher overall than those for OVERLAP. This result may be counterintuitive, but the reason is that the frames  Table 2 as the test set. The top 10 semantic frames with the highest numbers of instances in this set are highlighted.
in the NON-OVERLAP case were only evoked by a few verbs, making it relatively easy to obtain higher ranking of instances of the same frame as the query.

Embedding Visualization
To intuitively understand the embeddings given by the Vanilla model and two fine-tuned models, we visualized them by t-SNE. Figure 2 shows the twodimensional t-SNE projection of the contextualized embeddings of the frame-evoking verbs for all instances when Set 1 in Table 2 is the test set. We used v word , v w+m , and v mask for the Vanilla, Ada-Cos, and Triplet models, respectively. The weight α for v w+m was 0.3, which was the best value for one-step clustering methods with the Triplet and AdaCos models in Section 4. 9 We highlight the top ten semantic frames with the highest numbers of instances in this set.
In the Vanilla model, the instances for v word tended to be grouped by frame but were not sufficiently grouped into clusters. For example, the instances of the SELF_MOTION frame were divided into two large groups, while those of the REMOV-ING frame were scattered. The instances for v mask were somewhat more scattered than those for v word . In addition, v w+m tended to group instances of the same frame.
In the AdaCos and Triplet models, the instances for v word were grouped much better for each frame than those for non-fine-tuned v word . The results also confirmed that instances of frames with similar meanings, such as the PLACING and FILLING frames, were both identifiable and close. However, fine-tuned v word formed many lumps of instances. This suggests that deep metric learning incorporates too much of a verb's surface information. On the other hand, fine-tuned v mask was somewhat better than non-fine-tuned v mask , but not as good as fine-tuned v word . As deep metric learning may require the surface information about a verb to be induced, so fine-tuned v mask may not work well. The instances in fine-tuned v w+m were better grouped than those for fine-tuned v word , because instances of the same frame were more grouped.

Conclusion
We worked on the supervised semantic frame induction, and we proposed a model that uses deep metric learning to fine-tune a contextualized embedding model and applied the fine-tuned contextualized embeddings to perform semantic frame induction. In our experiments, we showed that fine-tuned BERT models with the triplet, ArcFace, and Ada-Cos losses are quite promising for semantic frame induction, as the human intuition in developing semantic frames such as those in FrameNet can be well captured by deep metric learning. In particular, the fine-tuned BERT model with the triplet loss performed considerably better than vanilla BERT even when the number of training instances was small; accordingly, the fine-tuned model is expected to have a wide range of applications. We also found that the one-step clustering can be a good choice in addition to two-step clustering when performing fine-tuning.
The ultimate goal of this study is to automatically construct semantic frame knowledge from large text corpora. This goal requires not only grouping the verbs according to the frames that they evoke but also grouping their arguments according to the frame element roles that they fill. Our proposed fine-tuned contextualized word embedding with deep metric learning could be effective for clustering arguments as it is for clustering verbs. We would like to explore how to achieve this goal.

Limitations
In this study, we only conducted experiments with English FrameNet, so it is unclear how useful this method will be for other corpora and multilingual resources. However, since our method does not depend on the properties of the specific corpus and language, it is quite possible that fine-tuning would improve the scores in other datasets. In addition, as our method requires supervised data from a semantic frame knowledge resource, some annotation will be necessary when applying the method to other languages that lack such a resource.

A Clustering Results with and without Linear Completion
Tables 8 and 9 list our experimental results for semantic frame induction when using v word , v w+m , and v mask in one-step and two-step clustering, respectively. The results show that v w+m tended to perform better than v word and v mask , thus demonstraiting the usefulness of linear completion. This tendency was noticeable for two-step clustering but more limited for one-step clustering.
Regarding the results for v word and v mask , the fine-tuning was effective for v word , as the scores improved considerably, but the effectiveness was limited for v mask . This was probably because the embedding of the special token "[MASK]," which was the source of the contextualized word embedding, was shared by all instances.

C Embedding Visualization of Remaining Models
In Figure 2, we showed a two-dimensional t-SNE projection of v word , v w+m , and v mask for the Vanilla, AdaCos, and Triplet models, respectively. Figure 3 shows a two-dimensional t-SNE projection of v word , v w+m , and v mask for the remaining models not included in Figure 2, namely, the Contrastive, Softmax, and ArcFace models, with the same setting. Figure 3 confirms that the three finetuned models, as well as the two fine-tuned models shown in Figure 2, are more coherent semantically than the Vanilla model, and the tendency of v word , v w+m , and v mask is similar. In addition, for the Contrastive model, whose performance was relatively poor among the fine-tuning models in     Fine-tuned BERT w/ Contrastive Figure 3: 2D t-SNE projections of v word , v w+m , and v mask for the Contrastive, Softmax, and ArcFace models, respectively, for all instances with Set 1 in Table 2 as the test set. The top 10 semantic frames with the highest numbers of instances in this set are highlighted.