A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition

Research on overlapped and discontinuous named entity recognition (NER) has received increasing attention. The majority of previous work focuses on either overlapped or discontinuous entities. In this paper, we propose a novel span-based model that can recognize both overlapped and discontinuous entities jointly. The model includes two major steps. First, entity fragments are recognized by traversing over all possible text spans, thus, overlapped entities can be recognized. Second, we perform relation classification to judge whether a given pair of entity fragments to be overlapping or succession. In this way, we can recognize not only discontinuous entities, and meanwhile doubly check the overlapped entities. As a whole, our model can be regarded as a relation extraction paradigm essentially. Experimental results on multiple benchmark datasets (i.e., CLEF, GENIA and ACE05) show that our model is highly competitive for overlapped and discontinuous NER.


Introduction
Named entity recognition (NER) (Sang and De Meulder, 2003) is one fundamental task for natural language processing (NLP), due to its wide application in information extraction and data mining (Lin et al., 2019b;Cao et al., 2019). Traditionally, NER is presented as a sequence labeling problem and widely solved by conditional random field (CRF) based models (Lafferty et al., 2001). However, this framework is difficult to handle overlapped and discontinuous entities (Lu and Roth, 2015;Muis and Lu, 2016), which we illustrate using two examples as shown in Figure 1. The two entities "Pennsylvania" and "Pennsylvania radio station" are nested with each other, 1 and the sec-ond example shows a discontinuous entity "mitral leaflets thickened" involving three fragments.
There have been several studies to investigate overlapped or discontinuous entities (Finkel and Manning, 2009;Lu and Roth, 2015;Muis and Lu, 2017;Katiyar and Cardie, 2018;Ju et al., 2018;Fisher and Vlachos, 2019;Luan et al., 2019;Wang and Lu, 2019). The majority of them focus on overlapped NER, with only several exceptions to the best of our knowledge. Muis and Lu (2016) present a hypergraph model that is capable of handling both overlapped and discontinuous entities. Wang and Lu (2019) extend the hypergraph model with long short-term memories (LSTMs) (Hochreiter and Schmidhuber, 1997). Dai et al. (2020) proposed a transition-based neural model for discontinuous NER. By using these models, NER could be conducted universally without any assumption to exclude overlapped or discontinuous entities, which could be more practical in real applications.
The hypergraph (Muis and Lu, 2016; Wang and Lu, 2019) and transition-based models (Dai et al., 2020) are flexible to be adapted for different tasks, achieving great successes for overlapped or discontinuous NER. However, these models need to manually define graph nodes, edges and transition actions. Moreover, these models build graphs or generate transitions along the words in the sentences gradually, which may lead to error propagation (Zhang et al., 2016). In contrast, the spanbased scheme might be a good alternative, which is much simpler including only span-level classification. Thus, it needs less manual intervention and meanwhile span-level classification can be fully parallelized without error propagation. Recently, Luan et al. (2019) utilized the span-based model for information extraction effectively.
In this work, we propose a novel span-based joint model to recognize overlapped and discon- Figure 1: Examples to illustrate the differences between the sequence labeling model and our span-based model. On the left, word fragments marked with the same number belong the same entity. On the right, blue rectangles denote the recognized entity fragments, and solid lines indicate the Succession or Overlapping relations between them (the two relations are mutually exclusive). tinuous entities simultaneously in an end-to-end way. The model utilizes BERT (Devlin et al., 2019) to produce deep contextualized word representations, and then enumerates all candidate text spans (Luan et al., 2019), classifying whether they are entity fragments. Following, fragment relations are predicted by another classifier to determine whether two specific fragments involve a certain relation. We define two relations for our goal: Overlapping or Succession, which are used for overlapped and discontinuous entities, respectively. In essence, the joint model can be regarded as one kind of relation extraction models, which is adapted for our goal. To enhance our model, we utilize the syntax information as well by using a dependency-guided graph convolutional network (Kipf and Welling, 2017;Jie and Lu, 2019;Guo et al., 2019).
We evaluate our proposed model on several benchmark datasets which includes both overlapped and discontinuous entities (e.g., CLEF (Suominen et al., 2013)). The results show that our model outperforms the hypergraph (Muis and Lu, 2016; Wang and Lu, 2019) and transition-based models (Dai et al., 2020). Besides, we conduct experiments on two benchmark datasets including only overlapped entities (i.e., GENIA (Kim et al., 2003) and ACE05). Experimental results show that our model can also obtain comparable performances with the state-of-the-art models (Luan et al., 2019;Wadden et al., 2019;Straková et al., 2019). In addition, we observe that our approaches for model enhancement are effective in the benchmark datasets. Our code is available at https://github.com/foxlf823/sodner.

Related Work
In the NLP domain, NER is usually considered as a sequence labeling problem (Liu et al., 2018;Lin et al., 2019b;Cao et al., 2019). With well-designed features, CRF-based models have achieved the leading performance (Lafferty et al., 2001;Finkel et al., 2005;Liu et al., 2011). Recently, neural network models have been exploited for feature representations (Chen and Manning, 2014;Zhou et al., 2015). Moreover, contextualized word representations such as ELMo (Peters et al., 2018), Flair (Akbik et al., 2018 and BERT (Devlin et al., 2019) have also achieved great success. As for NER, the end-to-end bi-directional LSTM CRF models (Lample et al., 2016;Ma and Hovy, 2016;Yang et al., 2018) is one representative architecture. These models are only capable of recognizing regular named entities.
For overlapped NER, the earliest model to our knowledge is proposed by Finkel and Manning (2009), where they convert overlapped NER as a parsing task. Lu and Roth (2015) propose a hypergraph model to recognize overlapped entities and lead to a number of extensions (Muis and Lu, 2017;Katiyar and Cardie, 2018;. Moreover, recurrent neural networks (RNNs) are also used for overlapped NER (Ju et al., 2018;. Other approaches include multi-grained detection (Xia et al., 2019), boundary detection (Zheng et al., 2019), anchorregion network (Lin et al., 2019a) and machine reading comprehension (Li et al., 2020). The stateof-the-art models for overlapped NER include the sequence-to-sequence (seq2seq) model (Straková et al., 2019), where the decoder predicts multiple

Input Word Rep
Graph Convolutional Network  labels for a word and move to next word until it outputs the "end of word" label, and the span-based model (Luan et al., 2019;Wadden et al., 2019), where overlapped entities are recognized by classification for enumerated spans. Compared with the number of related work for overlapped NER, there are no related studies for only discontinuous NER, but several related studies for both overlapped and discontinuous NER. Early studies addressed such problem by extending the BIO label scheme (Tang et al., 2013;Metke-Jimenez and Karimi, 2016). Muis and Lu (2016) first proposed a hypergraph-based model for recognizing overlapped and discontinuous entities, and then Wang and Lu (2019) utilized deep neural networks to enhance the model. Very recently, Dai et al. (2020) proposed a transition-based neural model with manually-designed actions for both overlapped and discontinuous NER. In this work, we also aim to design a competitive model for both overlapped and discontinuous NER. Our differences are that our model is span-based (Luan et al., 2019) and it is also enhanced by dependencyguided graph convolutional network (GCN) Guo et al., 2019).
To our knowledge, syntax information is commonly neglected in most previous work for overlapped or discontinuous NER, except Finkel and Manning (2009). The work employs a constituency parser to transform a sentence into a nested entity tree, and syntax information is used naturally to facilitate NER. By contrast, syntax information has been utilized in some studies for traditional regular NER. Under the traditional statistical setting, syntax information is used by manually-crafted features (Hacioglu et al., 2005;Ling and Weld, 2012) or auxiliary tasks (Florian et al., 2006)

Method
The key idea of our model includes two mechanisms. First, our model enumerates all possible text spans in a sentence and then exploits a multiclassification strategy to determine whether one span is an entity fragment as well as the entity type. Based on this mechanism, overlapped entities could be recognized. Second, our model performs pairwise relation classifications over all entity fragments to recognize their relationships. We define three kinds of relation types: • Succession, indicating that the two entity fragments belong to one single named entity. • Overlapping, indicating that the two entity fragments have overlapped parts. • Other, indicating that the two entity fragments have other relations or no relations.
With the Succession relation, we can recognize discontinuous entities.
Through the Overlapping relation, we aim to improve the recognition of overlapped entities with double supervision. The proposed model is essentially a relation extraction model being adapted for our task. The architecture of our model is illustrated in Figure 2, where the main components include the following parts: (1) word representation, (2) graph convolutional network, (3) span representation, and (4) joint decoding, which are introduced by the following subsections, respectively.

Word Representation
We exploit BERT (Devlin et al., 2019) as inputs for our model, which has demonstrated effective for a range of NLP tasks. 2 Given an input sentence x = {x 1 , x 2 , ..., x N }, we convert each word x i into word pieces and then feed them into a pretrained BERT module. After the BERT calculation, each sentential word may involve vectorial representations of several pieces. Here we employ the representation of the beginning word piece as the final word representation following (Wadden et al., 2019). For instance, if "fevers" is split into "fever" and "##s", the representation of "fever" is used as the whole word representation. Therefore, all the words in the sentence x correspond to a matrix H = {h 1 , h 2 , ..., h N } ∈ R N ×d h , where d h denotes the dimension of h i .

Graph Convolutional Network
Dependency syntax information has been demonstrated to be useful for NER previously (Jie and Lu, 2019). In this work, we also exploit it to enhance our proposed model. 3 Graph convolutional network (GCN) (Kipf and Welling, 2017) is one representative method to encode dependency-based graphs, which has been shown effective in information extraction . Thus, we choose it as one standard strategy to enhance our word representations. Concretely, we utilize the Figure where W (l) and b (l) are the weight and bias of the l-th layer. A ∈ R N ×N is an adjacency matrix obtained from the dependency graph, where A ij = 1 indicates there is an edge between the word i and j in the dependency graph. Figure 2 offers an example of the matrix which is produced by the corresponding dependency syntax tree.
In fact, A can be considered as a form of hard attention in GCN, while AGGCN (Guo et al., 2019) aims to improve the method by using A in the lower layers and updating A at the higher layers via multi-head self-attention (Vaswani et al., 2017) as below: where W t Q and W t K are used to project the input H t ∈ R N ×d head (d head = d h N head ) of the t-th head into a query and a key.Ã t ∈ R N ×N is the updated adjacency matrix for the t-th head.
For each head t, AGGCN usesÃ t and a densely connected layer to update the word representations, which is similar to the standard GCN as shown in Equation 1. The output of the densely connected layer isH t ∈ R N ×d h . Then a linear combination layer is used to merge the output of each head, After that,H is concatenated with the original word representations H to form final word rep-

Span Representation
We employ span enumeration (Luan et al., 2019) to generate text spans. Take the sentence "The mitral valve leaflets are mildly thickened" in Figure 2 as an example, the generated text spans will be "The", "The mitral", "The mitral valve", ..., "mildly", "mildly thickened" and "thickened". To represent a text span, we use the concatenation of word representations of its startpoint and endpoint. For example, given word representations and a span (i, j) that starts at the position i and ends at j, the span representation will be where w is a 20-dimensional embedding to represent the span width following previous work (Luan et al., 2019;Wadden et al., 2019). Thus, the dimen-

Decoding
Our decoding consists of two parts. First, we recognize all valid entity fragments, and then perform pairwise classifications over the fragments to uncover their relationships. Entity Fragment Recognition: Given a span (i, j) represented as s i,j , we utilize one MLP to H still works even ifH is invalid.
if ISSUCCESSION(si,j, sĩ ,j ) then 7: E ← < si,j, sĩ ,j > 8: Graph G = {V, E} 9: for g in FINDCOMPLETESUBGRAPHS(G) do 10: R ← g 11: return R classify whether the span is an entity fragment and what is the entity type, formalized as: where p 1 indicates the probabilities of entity types such as Organization, Disease and None (i.e., not an entity fragment). Fragment Relation Prediction: Given two entity fragments (i, j) and (ĩ,j) represented as s i,j and sĩ ,j , we utilize another MLP to classify their relations: Noticeably, although the overlapped entities can be recognized at the first step, here we use the Overlapping as one auxiliary strategy to further enhance the model.
During decoding (Algorithm 1), our model recognizes entity fragments from text spans (lines 2-4) in the input sentence and selects each pair of these fragments to determine their relations (lines 5-7). Therefore, the prediction results can be considered as an entity fragment relation graph (line 8), where a node denotes an entity fragment and an edge denotes the relation between two entity fragments. 5 The decoding object is to find all the subgraphs in which each node connects with any other node (line 9). Thus, each of such subgraph composes an entity (line 10). In particular, the entity fragment that has no edge with others composes an entity by itself.  (2015) respectively. The statistics of CLEF-Dis are sentence numbers, others are document numbers.

Training
During training, we employ multi-task learning (Caruana, 1997;Liu et al., 2017) to jointly train different parts of our model. 6 The loss function is defined as the negative log-likelihood of the two classification tasks, namely Entity Fragment Recognition and Fragment Relation Prediction: where y ent and y rel denote the corresponding goldstandard labels for text spans and span pairs, α and β are the weights to control the task importance. During training, we use the BertAdam algorithm (Devlin et al., 2019) with the learning rate 5 × 10 −5 to finetune BERT and 1 × 10 −3 to finetune other parts of our model. 2014). Concretely, they used the training set and test set of the ShARe/CLEF eHealth Evaluation Lab 2013 as the training and development set, and they also used the development set of the SemEval 2014 Task 7 as the test set. In addition, they selected only the sentences that contain at least one discontinuous entity. Finally, the training, development and test sets contain 534, 303 and 430 sentences, respectively. We call this dataset as CLEF-Dis in this paper. Moreover, we also follow Dai et al. (2020) to evaluate models using the CADEC dataset proposed by Karimi et al. (2015). We follow the setting of Dai et al. (2020) to split the dataset and conduct experiments.
To show our model is comparable with the stateof-the-art models for overlapped NER, we conduct experiments on GENIA (Kim et al., 2003) and ACE05. For the GENIA and ACE05 datasets, we employ the same experimental setting in previous works (Lu and Roth, 2015;Muis and Lu, 2017;Luan et al., 2019), where 80%, 10% and 10% sentences in 1,999 GENIA documents, and the sentences in 370, 43 and 51 ACE05 documents are used for training, development and test, respectively. The statistics of all the datasets we use in this paper is shown in Table 1.
Evaluation Metrics: In terms of evaluation metrics, we follow prior work (Lu and Roth, 2015;Muis and Lu, 2016;Lu, 2018, 2019) and employ the precision (P), recall (R) and F1-score (F1). A predicted entity is counted as true-positive if its boundary and type match those of a gold entity. For a discontinuous entity, each span should match a span of the gold entity. All F1 scores reported in Section 5 are the mean values from five runs of the same setting.   Implementation Details: For hyper-parameters and other details, please refer to Appendix D. Table 2 shows the results on the CLEF dataset. As seen, Tang et al. (2013) and Tang et al. (2015) adapted the CRF model, which is usually used for flat NER, to overlapped and discontinuous NER. They modified the BIO label scheme to BIOHD and BIOHD1234, which use "H" to label overlapped entity segments and "D" to label discontinuous entity segments. Surprisingly, the recently-proposed transition-based model (Dai et al., 2020) does not perform better than the CRF model (Tang et al., 2015), which may be because Tang et al. (2015) have conducted elaborate feature engineering for their model. In contrast, our model outperforms all the strong baselines with at least about 5% margin in F1. Our model does not rely on feature engineering or manually-designed transitions, which is more suitable for modern end-to-end learning.

Results on CLEF
We further perform ablation studies to investigate the effect of dependency-guided GCN and the overlapping relation, which can be removed without influencing our major goal. As shown in Table 2, after removing either of them, the F1 scores    (2019), we also replace BERT with the word embeddings pretrained on PubMed (Chiu et al., 2016). As we can see, our model also outperforms their model by 0.3%.

Results on CADEC
As shown in Table 4, Metke-Jimenez and Karimi (2016) employed the similar method in (Tang et al., 2013) by expanding the BIO label scheme to BIOHD. Tang et al. (2018) also experimented the BIOHD label scheme, but they found that the result of the BIOHD-based method was slightly worse than that of the "Multilabel" method (65.5% vs. 66.3% in F1). Compared with the method in (Metke-Jimenez and Karimi, 2016), the performance improvement might be mainly because they used deep neural networks (e.g., LSTM) instead of shallow non-neural models.   2020) is still the best. Our full model slightly outperforms the transition-based model by 0.5%. In this dataset, we do not observe mutual benefit between the dependency-guided GCN and overlapped relation prediction modules, since our model achieves better results when using them separately (69.9%) than using them jointly (69.5%). However, when using them separately, the F1 is still 0.6% higher than the one using neither of them. Without BERT, the performance of our model drops by about 3% but it is still comparable with the performances of the methods without contextualized representations.

Result Analysis based on Entity Types
Comparing with BiLSTM-CRF To show the necessity of building one model to recognize regular, overlapped and discontinuous entities simultaneously, we analyze the predicted entities in the CLEF-Dis dataset and classify them based on their types, as shown in BiLSTM-CRF model can achieve a better performance compared with our model, especially the precision value is much higher. One likely reason might be that the BiLSTM-CRF model is capable of using the label dependence to detect entity boundaries accurately, ensuring the correctness of the recognized entities, which is closely related to the precision. Nevertheless, our model can lead to higher recall, which reduces the gap between the two models. If considering both regular and overlapped entities, the recall of our model is greatly boosted, and thus the F1 increases concurrently. If both regular and discontinuous entities are included, the performance of our model rises significantly to 50.9% due to the large scale of discontinuous entities. When all types of entities are concerned, the F1 of our model further increases by 0.8%, indicating the effectiveness of our model in joint recognition of overlapped, discontinuous and regular entities.
Comparing with the Transition-Based Model As shown in Figure 5, we also compare our model with the transition-based model (Dai et al., 2020) based on entity types by analyzing the results from one run of experiments. Note that since we do not tune the hyper-parameters of the transition-based model elaborately, the performance is not as good as the one that they have reported. As seen, our model performs better in all of the four groups, namely regular, regular+overlapped, regular+discontinuous, regu-lar+overlapped+discontinuous entity recognition. However, based on the observation on the bars in different groups, we find that the main superiority  of our model comes from regular entity recognition. In recognizing overlapped entities, our model is comparable with the transition-based model, but in recognizing discontinuous entities, our model performs slightly worse than the transition-based model. This suggests that a combination of spanbased and transition-based models may be a potential method for future research. Table 5 shows the results of the GENIA and ACE05 datasets, which include only regular and overlapped entities. Our final model achieves 77.8% and 83.0% F1s in the GENIA and ACE05 datasets, respectively. By removing the dependency-guided GCN, the model shows an averaged decrease of 0.4%, indicating the usefulness of dependency syntax information. The finding is consistent with that of the CLEF dataset. Interestingly, we note that the overlapping relation also brings a positive influence in this setting. Actually, the relation extraction architecture is not necessary for only regular and overlapped entities, because the decoding can be finished after the first entity fragment recognition step. The observation doubly demonstrates the advantage of our final model. We also compare our results with several state-of-the-art results of the previous work on the two datasets in

Conclusion
In this work, we proposed an efficient and effective model to recognize both overlapped and discontinuous entities simultaneously, which can be applied to any NER dataset theoretically, since no extra assumption is required to limit the type of named entities. First, we enumerate all spans in a given sentence to determine whether they are valid entity fragments, and then relation classifications are performed to check the relationships between all fragment pairs. The results show that our model is highly competitive to the state-of-the-art models for overlapped or discontinuous NER. We have conducted detailed studies to help comprehensive understanding of our model.  • If ELMo is used, each word x i will first be split into characters and then input into character-level convolutional networks to obtain character-level word representations. Finally, all word representations in the sentence will be input into 3-layer BiLSTMs to generate contextualized word representations, which can also be denoted as H = {h 1 , h 2 , ..., h N }

References
• If BERT is used, each word x i will be converted into word pieces and then fed into a pretrained BERT module. After the BERT calculation, each sentential word may involve vectorial representations of several pieces. Here we employ the representation of the beginning word piece as the final word representation following (Wadden et al., 2019). For instance, if "fevers" is split into "fever" and "##s", the representation of "fever" is used as the whole word representation. Therefore, all the words in the sentence x can also be represented as a matrix H = {h 1 , h 2 , ..., h N } In addition, a bidirectional LSTM (BiLSTM) layer can be stacked on word encoders to further capture contextual information in the sentence, which is especially helpful for non-contextualized word representations such as Word2Vec. Concretely, the word representations H = {h 1 , h 2 , ..., h N } will be input into the BiLSTM layer and consumed in the forward and backward orders. Assuming that the outputs of the forward and backward LSTMs are to compose the final word representationsĤ = {ĥ 1 ,ĥ 2 , ...,ĥ N }.
We investigate the effects of different word encoders and the BiLSTM layer in the experiments. As shown in Table 6, we compare the effects of different word representation methods in the CLEF and CLEF-Dis datasets, where the size of the former one is much bigger than that of the latter, in order to also investigate the impact of the data size on word representations. From the table, the first observation is that BERT is the most effective word representation method. Surprisingly, Word2Vec is more effective than ELMo, which may be because ELMo is exclusively based on characters and cannot effectively capture the whole meanings of words. Therefore, this suggests that it is better to use ELMo with Word2Vec.
Second, we find that BiLSTM is helpful in all cases, especially for Word2Vec. This may be because Word2Vec is a kind of non-contexualized word representations, which particularly needs the help of BiLSTM to capture contexual information. In contrast, BERT is not very sensitive to the help of BiLSTM as Word2Vec and ELMo, which may be because the transformer in BERT has already captured contexual information.
Third, we observe that the effect of BiLSTM is more obvious for the CLEF-Dis dataset. Considering the data sizes of the CLEF and CLEF-Dis datasets, it is more likely that small datasets need the help of BiLSTM, while big datasets are less sensitive to the BiLSTM and BERT is usually enough for them to build word representations.   Table 8: Effect of joint training between entity fragment recognition (EFR) and fragment relation prediction (FRP) on the CLEF-Dis dataset. P, R and F1 are the results for EFR.

B Case Studies
To understand how syntax information helps our model to identify discontinuous or overlapped entities, we offer two examples in the CLEF dataset for illustration, as shown in Table 7. Both the two examples are failed in the model without using dependency information, but are correctly recognized in our final model. In the first example, the fragments "displaced" and "fracture" of the same entity are far away from each other in the original sentence, while they are directly connected in the dependency graph. Similarly, in the second example, the distance between "Tone" and "decreased" is 9 in the sentence, while their dependency distance is only 1. These dependency connections can be directly modeled in dependency-guided GCN, thus, resulting in strong clues for the NER, which makes our final model work.

C Effect of Joint Training
As mentioned in Section 3.5, we employ multi-task learning to jointly train our model between two tasks, namely entity fragment recognition and fragment relation prediction. Therefore, it is interesting to show the effect of joint training by observing the performance changes of the entity fragment recognition (EFR) task before and after adding the fragment relation prediction (FRP) task. As seen in Table 8, the F1 of entity fragment recognition increases by 0.3% after adding the FRP task, which shows that the FRP task could improve the EFR CLEF CADEC GENIA ACE05  Table 9: Main hyper-parameter settings in our model for all the datasets. d h -Section 3.1; N head , l and d f -Section 3.2; d s -Section 3.3; α and β-Section 3.5. Note that the hyper-parameter settings in the CLEF-Dis dataset is the same as those in the CLEF dataset.
task. This suggests that the interaction between entity fragment recognition and fragment relation prediction could benefit our model, which also indicates that end-to-end modeling is more desirable.

D Implementation Details
Our model is implemented based on Al-lenNLP (Gardner et al., 2018). The number of parameters is about 117M plus BERT. We use one GPU of NVIDIA Tesla V100 to train the model, which occupies about 10GB memories. The training time for one epoch is between 2∼6 minutes on different datasets. Table 9 shows the main hyper-parameter values in our model. We tune the hyper-parameters based on the results of about 5 trials on development sets. Below are the ranges tried for the hyper-parameters: the GCN layer l (1, 2), the GCN head N head (2, 4), the GCN output size d f (20, 48, 64), the MLP layer (1, 2), the MLP size (100,150,200), the loss weight α and β (0.6, 0.8, 1.0). Since we employ the BERT BASE , the dimension d h of word representations is 768 except in the CLEF and CADEC datasets, where we use a BiLSTM layer on top of BERT to obtain word representations since we observe performance improvements. We try 200 and 400 hidden units for the BiLSTM layer.
Considering the domains of the datasets, we employ clinical BERT 1 (Alsentzer et al., 2019), SciBERT 2 (Beltagy et al., 2019) and Google BERT 3 (Devlin et al., 2019) for the CLEF (and CADEC), GENIA and ACE05 datasets, respectively. In addition, since our model needs syntax information for dependency-guided GCN, but the datasets do not contain gold syntax annotations, we utilize the Stanford CoreNLP toolkit (Manning et al., 2014) to perform dependency parsing.