Joint Multi-Decoder Framework with Hierarchical Pointer Network for Frame Semantic Parsing

Current researches on frame semantic parsing include three subtasks, namely frame identiﬁ-cation, argument identiﬁcation and role clas-siﬁcation. Most of previous systems process these subtasks independently and ignore their interactions. We introduce a novel architecture based on multi-decoder strategy to handle these subtasks together. The multi-decoder strategy strengthens the interactions. More-over, we design a hierarchical pointer network for argument identiﬁcation which reduces the computational complexity. To our best knowledge, it’s the ﬁrst practice to introduce the pointer network into frame semantic parsing. The experiments show improvement over state of the art models on FrameNet dataset.


Introduction
Frame semantic parsing is a fundamental study in Natural Language Processing. It aims to parse sentences into frame-style semantic structures defined in FrameNet (Baker et al., 1998).
An example of frame-style semantic structures is shown in Figure 1. The word write.v is a target that evokes the frame called Text creation. The phrases underlined with green lines are called arguments. Author, Text and Form are roles (also called frame elements) the arguments play in this frame. Hence the frame semantic parsing contains three subtasks, namely frame identification, argument identification and role classification. For a sentence with a given target, the frame identification is to disambiguate the frame for the target based on its contextual information, the argument identification is to identify the boundaries of all the arguments, and the role classification is to assign a semantic role to each argument we have found.
Early work (Hermann et al., 2014;FitzGerald et al., 2015;Hartmann et al., 2017) on frame seman- * Corresponding author tic parsing adopts pipeline strategy. Their models apply independent models to handle different subtasks which ignore the interactions among subtasks. Moreover, the pipeline strategy usually causes error propagation problem. The accuracy of frame identification can become the bottleneck of the overall performance. Later work (Yang and Mitchell, 2017;Peng et al., 2018) processes all the subtasks jointly by optimizing them together during training. Their joint models show improvement over pipeline models, which demonstrates the benefit of joint training strategy. However, their systems don't have specific design to model the interactions among the subtasks.
To strengthen the interactions of subtasks, we propose a joint framework based on three taskspecific decoders. The interactions in our framework are mainly reflected in two aspects. On one hand, the representations of both the target and its frame that derived from frame identification decoder are applied to predict the arguments and roles. On the other hand, the argument identification decoder and the role classification decoder work in an alternate way, and thus they interact with each other during the entire process of decoding.
The interactions bring two benefits. First, the frame information predicted by frame identification decoder makes the predictions of arguments and roles more frame-specific. Second, the alternate decoding strategy makes the current argument's and role's prediction influenced by all previous SENTENCE Coming to Goodwill was the first step toward my becoming totally independent.

TARGET
come.v(Arriving), to.prep(Goal), first.a(Ordinal numbers), step.n(Intentionally act), become.v(Becoming), totally.adv(Degree) decisions, which captures relations among different arguments and roles. For the sentence shown in Figure 1, the argument I and its role Author can contribute to the predictions of the argument my name and the role Text. Therefore, the argument identification and the role classification can benefit from each other by considering arguments and roles already obtained. For argument identification, previous models (Yang and Mitchell, 2017;Peng et al., 2018) enumerate all possible spans to identify the arguments, which brings high computational complexity. To reduce the computational complexity, we design a hierarchical pointer network in the argument identification decoder that predicts boundaries of arguments directly.
In addition, we design a target-aware attention mechanism. The target-aware attention mechanism aggregates all targets in the same sentence to model interactions among different targets, since frames evoked by different targets in the same sentence are usually closely related. Such interaction modeling could be helpful in frame identification. For example, the target to.prep in Table 1 can evoke the frame Goal or Locative relation, and other cooccurrence targets (such as come.v) make to.prep more likely to evoke frame Goal instead of the other frame.
Overall, our main contributions can be summarized as follows: • We design a novel multi-decoder framework to jointly process all the subtasks of frame semantic parsing. The multi-decoder strategy strengthens the interactions among frame identification, argument identification and role classification.
• We design a hierarchical pointer network that predicts the boundaries of arguments directly. The hierarchical pointer network predicts arguments within linear computational complexity. To our best knowledge, it's the first practice to introduce the pointer network into frame semantic parsing task.
• We design a target-aware attention mechanism to aggregate the semantic information of other targets in the same sentence.
We evaluate our model on FrameNet dataset, and the experiments show that our model outperforms state of the art models, which demonstrates the effectiveness of our model.

Related Work
Frame semantic parsing task is first proposed by Gildea and Jurafsky (2002) and has drawn attention since the SemEval 2007 shared task (Baker et al., 2007) was released. Early researches on frame semantic parsing focus on the feature-engineered methods (Johansson and Nugues, 2007;. Most of the early researches regard the frame semantic parsing as a pipeline of classification tasks and employ machine learning algorithms (such as Support Vector Machines etc.).
With the popularity of neural network and representation learning, neural network models are introduced to model frame semantic parsing problem. Hermann et al. (2014) uses distributed representations in frame identification and embedded both frames and the contextual representations of words into a shared low-dimension vector space. FitzGerald et al. (2015) uses a neural network to learn embeddings of both arguments and semantic roles, which adopts fine-grained similarity between roles to overcome the sparsity of some labeled data. Besides, a system based on pre-trained word distributed representations (Hartmann et al., 2017) is developed to improve the domain adaptation of their model. These models still work in pipeline way.
However, pipeline models usually ignore the interactions among subtasks and suffer from the problem of error propagation. Joint models are proposed to solve these problems. Yang and Mitchell (2017) proposes an ensemble strategy that that integrates two different models into an ensemble model. Peng et al. (2018) proposes a multi-task framework which jointly handles two different semantic pars- ing tasks from disjoint data. Both of their models process all the subtasks jointly by optimizing them together during training. Their experiments show improvement over previous pipeline models, which proves the benefit of joint training strategy. However their models don't have specific design for the interactions.
The common neural network architectures for frame semantic parsing can be divided into sequence labeling models and relational models. Yang and Mitchell (2017) proposes a sequence labeling model based on BIO tagging scheme. The model contains multiple LSTM layers and a Conditional Random Field (CRF) layer. Swayamdipta et al. (2017) adopts a segmentation RNN and a relational model to capture span-level dependencies between predicate and arguments. The relational model enumerates all possible spans to compute the matching scores. Both of these two type models above require O(n 2 ) computational complexity. To reduce the high computational complexity, we design a hierarchical pointer network that achieves the identifying arguments within linear computational complexity.

Method
As is shown in Figure 2, our framework consists of four modules. (1) the encoder module (2) the frame identification decoder module. (3) the argument identification decoder module. (4) the role classification decoder module.
Specifically, given a target t and a sentence S = w 0 , . . . , w n−1 , the encoder module calculates the contextual representations h 0 , . . . , h n−1 , then all decoder modules handle three subtasks jointly. The frame identification decoder builds a target representation for t and identifies the frame f ∈ F evoked by t. Suppose that there are k argument spans a 0 , . . . , a k−1 of f in S. For each argument a τ = w i s τ , . . . , w i e τ , the argument identification decoder identifies the boundaries i s τ and i e τ , and the role classification decoder assigns a semantic role r τ ∈ R f to a τ . The F and R f mentioned above are the sets of all frames and all semantic roles of f defined in FrameNet.
Three decoders interact with each other as follows: • Frame identification decoder builds a target representation for t to identify the frame f . Both the target representation and the embedding of f will be taken as inputs to the other decoders for argument identification and role classification.
• Role classification decoder assigns a role r τ to current argument span a τ , and then the embedding of r τ will be taken as an input to identify the boundary of next argument a τ +1 . In other words, these two decoders work in an alternate way, so identifying a τ +1 and r τ +1 will consider all the historical information of a 0 , . . . , a τ and their roles r 0 , . . . , r τ .

Encoder Module
Encoder module aims at converting the sentence S = w 0 , . . . , w n−1 into a sequence of vectors h 0 , . . . , h n−1 , where h i is the contextual representation of word w i . For each token, we concatenate its word embedding e w i , lemma embedding e l i , POS embedding e p i and a binary tag embedding e b i : The binary tag embedding e b i is to distinguish t from other words in S. Let i t be the position index of t in S, then we can calculate e b i : At last, e i is fed to the encoder to get contextual representation h i : The word embedding and lemma embedding are initialized with Glove (Pennington et al., 2014) while POS embedding is randomly initialized. We use Bi-LSTM as the encoder in our experiment, which can be also replaced with any other encoder model such as Bert (Devlin et al., 2018). The dimension of e i is d e and the dimension of h i is d h .

Frame identification module
In frame identification module, we build a target representation r t for t and identify the frame f based on r t . As there are likely to be multiple targets t 0 , . . . , t m−1 in S evoking multiple frames f 0 , . . . , f m−1 and we believe that other targets in S can contribute to identifying current frame f for target t, we design a target aware attention mechanism to aggregate contextual representations of all targets T S = {t 0 , . . . , t m−1 } in S (also contains the target t): For the target, we concatenate c t , its contextual representation h it and its embedding e it to get the target representation r t : With the target representation, we can generate the probability distribution of frames to identify f by argmax operation:

Argument Identification Module
Argument identification decoder module identifies the boundaries of argument spans a 0 , . . . , a k−1 sequentially. For τ -th argument a τ = w i s τ , . . . , w i e τ , the historical information of a 0 , . . . , a τ −1 and their roles r 0 , . . . , r τ −1 is supposed to be utilized. Hence We use LSTM A to record the historical information, and similarly, another LSTM named LSTM R is applied in role classification decoder. Argument identification decoder interacts with other decoders by taking their output as input, here is how LSTM A works at τ -th argument: h arg τ and h role τ are hidden states at timestep τ of LSTM A and LSTM R . e f represents the embedding of f , and e r τ −1 represents the embedding of r τ −1 .
As we want to identify the start and end positions of a τ , namely i s τ and i e τ , we build two kinds of feature representations to extract boundary feature from h arg τ and e f : The dimensions of both h ST A τ and h EN D τ are the same as contextual representations h 0 , . . . , h τ −1 , and the MLP in our experiment consists of two linear layers and a relu activation function in between.

Hierarchical pointer network
With two representations h ST A τ and h EN D τ , we apply a hierarchical pointer network to identify i s τ and i e τ . The hierarchical pointer network contains two pointer networks as is shown in Figure 3. The hierarchical pointer network identifies the i s τ firstly and then identify the i e τ based on i s τ . If we identify them simultaneously, the i s τ and i e τ may sometimes be inconsistent. Besides, to avoid duplicate prediction, all spans in a 0 , . . . , a τ −1 are masked when identifying i s τ and i e τ . The prediction process is: H is d h × n matrix (h 0 , . . . , h n−1 ) that represents the encoder output of sentence S. W 4 , W 5 and W 6 are d h × d h weight matrixes.
The hierarchical pointer network achieves arguments identification within linear computational complexity. For an n-tokens and k-arguments sentence, our model can identify the start and end positions of each argument with O(2n) computational complexity, and O(2n · k) for all arguments.

Role Classification Module
The role classification module assigns semantic roles r 0 , . . . , r k−1 to arguments a 0 , . . . , a k−1 . Assigning r τ to a τ also needs to consider the semantic information of a 0 , . . . , a τ −1 and their roles r 0 , . . . , r τ −1 . Hence we use the same LSTM architecture named LSTM R to record them. Both the contextual information of a τ and the frame embedding e f are used to predict r τ : h ie + h is represents the boundary feature of a τ and h ie − h is represents the inner feature of the span (Wang and Chang, 2016;Cross and Huang, 2016;Ouchi et al., 2018). With the probability distribution P (r τ |S, t, f, a τ ), we can predict the role r τ .
Moreover, we add a special role 'None' at the final decoding step and let r k be 'None'. During the inference stage, the role classification decoder and argument identification decoder will automatically stop when predicting 'None'.

Loss Function
We utilize cross-entropy loss to maximize the probability of the oracle frame type, span boundaries (start-end pair) and role types: log(P (r τ |S, t, f, a τ ))+ log(P (r N one |S, t, f, a k )) We optimize the losses of the three subtasks jointly: α, β and γ are hyper-parameters that adjust the direction of training optimization.
We also follow the same train/development/test split. Meanwhile, previous work adds the partiallyannotated exemplar sentences (each exemplar sentence contains only one target). As is reported in previous work Yang and Mitchell, 2017;Swayamdipta et al., 2017), the exemplar sentences data can help to improve their models' performance. We add it as pre-train data for our model. Pre-process. Previous work removes the argument spans longer than 20, which is a constraint that helps to reduce the computational complexity from O(n 2 ) to O(n). Though our model doesn't need such constraint because of better computational complexity, we hold the same setting as previous work for comparison. We also report the result that training our model without length constraint. Setup. We train our models by two steps following previous work . At first step, we pre-train our model with partially-annotated exemplar sentences data. Then we train the model on the offcial train set. We evaluate our model on development test and save the best performance model for test. We use Glove (Pennington et al., 2014) to initialize the word embeddings, and average the existing embeddings for out-of-vocabulary words. We randomly initialize embeddings for part-of-speech tags, and token type tags. All the embeddings are learnable during training.
Other detail hyper-parameters are shown on Table 2. Model. We compare our model with following previous models: SEMAFOR: A widely known system  that uses a variety of syntactic features.
Framat+context: An extension version of Framat that adds extra context features by Roth and Lapata (2015).
Hermann et al. (2014): A frame identification model uses feature representation based on word embedding and WSABIE algorithm (Weston et al., 2011). Open-SESAME: A pipeline model that predicts frame by FitzGerald et al. (2015) and designs a softmax-margin segmental RNN to improve argu-

Experiment Metrics And Result
We evaluate our model on the metrics of frame identification accuracy and full structure extraction. Note that previous systems may also report ensemble models based on different ensemble methods to improve models' performance, and the model (Peng et al., 2018) based on multi-task framework brings extra train data. For comparability, we only report the performance of single models that trained on framenet data only. We note that none of above-mentioned previous models are based on Bert which is widely applied in many NLP tasks. To explore the impact of Bert on frame semantic parsing, we implement a Bert-based version of our model and also report results of Bert-based model.   Frame Identification. The metrics of frame identification accuracy includes Ambiguous and All as is shown in Table 3. The ambiguous metrics evaluates targets evoking more than one possible frame in FrameNet and All evaluates all the targets. According to Peng et al. (2018), some previous studies' ambiguous lexical unit sets are not the same as the one from the official frame directory, which makes their results uncomparable. Therefore, it's fairer to use ALL to evaluate the performance of frame identification. Our model outperforms all previous models (0.2 point over SOTA  that it's because the decoders of our model fully interact and make the current decision by considering all previous steps' decisions. Such strategy is likely to predict more complete arguments and roles.

Ablation study
We train our model in different settings and evaluate them on the metrics of full structure extraction to measure their overall performance. We consider the influences of decoders' interactions, pretrain data (partially-annotated exemplar sentences), targets-aware attention mechanism (TAM) and the length constraint of arguments.
As mentioned before, all the decoders of our model interact with each other and the interactions are reflected in two aspects. To prove the effectiveness of them, we remove the interactive parts of the decoders respectively. Table 5 shows the results of models with following setting: Without interaction (Arg&Role) means the interaction between argument identification decoder and role classification decoder is deleted.
Without interaction (Frame) means the identified frame information in frame identification decoder is not accessible to the other decoders.
Without interaction (Both) represents that both of the interactions above are removed.
As is shown in Table 5, the performances of the models all drop in varying degrees without any kind of interactive parts. The effect of interaction (Frame) is more pronounced than interaction (Arg&Role) (73.6 to 73.9). The model's performance without both kinds of interactive parts will drop from 76.0 to 75.3. And we noticed that the  recalling rate of our model is decreasing with the remove of interactions. It verifies the previous analysis that the interactions among decoders make our model predict in a global view and thus our model is likely to predict more complete arguments and roles. Table 6 shows the performance differences in the influences of pre-train data and targets-aware attention mechanism. The performance of the model without the pre-train step drops by 3.1 points on F1, which demonstrates that adding pre-train data is beneficial to model's performance. The result is consistent with the conclusion of Yang and Mitchell (2017) that models will get a 3-4 points increase on F1 if adding partially-annotated exemplar sentences. Also, we consider the influence of the targets-aware mechanism. The targets-attention mechanism doesn't work at pre-train step because the partially-annotated exemplar sentences only contains one target per sentence. To eliminate the interference of the gap between the train data and the pre-train data, we hold the same setting of skipping the pre-train step. As shown in Table 6, the model without TAM drops from 72.9 to 72.5 on F1. It proves that the targets-aware attention mechanism contributes to overall performance of frame semantic parsing.
As mentioned before, our model has an advantage in terms of computational complexity. We remove the length constraint of spans which is adopted in previous work. Table 7 shows the result. We notice that our model has a slight increase without length constraint of arguments. The result shows that our model get benefit with complete data. Moreover, it proves that the hierarchical pointer network is good at capturing long distance dependency relation. We encourage future work to train and evaluate on complete data if their computational complexity allows.

Error Analysis
We follow the error analysis method of Peng et al. (2018) and compare our model with theirs. Table 8 shows the proportions of five error types. Though missing arguments is the major error for both of our model and Peng et al. (2018), our model shows a great decrease by 10.1%, which proves that our model prefers to predict more complete arguments and is more likely to overlap gold arguments. However, it correspondingly brings increases on extra predicted arguments, Role error and Span error. We analyze that it's because our model captures the relation between different roles and is able to make current decision by considering all previous steps' action information, it prefers to predict the role which is related to previous predicted roles. Such strategy is more likely to overlap gold arguments. However, it may also predict more arguments and roles than the ground truth.

Conclusion
We design a multi-decoder framework to process all the subtasks of frame semantic parsing jointly. The multi-decoder framework strengthens the interactions among these three tasks. Our model works in an alternate way, which predicts the argument and the role by considering all previous decisions. We apply a hierarchical pointer network which achieves the argument identification with linear computational complexity. Experiments show improvement over state of the art models.