Dependency Parsing as MRC-based Span-Span Prediction

Higher-order methods for dependency parsing can partially but not fully address the issue that edges in dependency trees should be constructed at the text span/subtree level rather than word level. In this paper, we propose a new method for dependency parsing to address this issue. The proposed method constructs dependency trees by directly modeling span-span (in other words, subtree-subtree) relations. It consists of two modules: the text span proposal module which proposes candidate text spans, each of which represents a subtree in the dependency tree denoted by (root, start, end); and the span linking module, which constructs links between proposed spans. We use the machine reading comprehension (MRC) framework as the backbone to formalize the span linking module, where one span is used as query to extract the text span/subtree it should be linked to. The proposed method has the following merits: (1) it addresses the fundamental problem that edges in a dependency tree should be constructed between subtrees; (2) the MRC framework allows the method to retrieve missing spans in the span proposal stage, which leads to higher recall for eligible spans. Extensive experiments on the PTB, CTB and Universal Dependencies (UD) benchmarks demonstrate the effectiveness of the proposed method. The code is available at https://github.com/ShannonAI/mrc-for-dependency-parsing


Introduction
Dependency parsing is a basic and fundamental task in natural language processing (NLP) (Eisner, 2000;Nivre, 2003;McDonald et al., 2005b). Among existing efforts for dependency parsers, graph-based models (McDonald et al., 2005a;Pei et al., 2015) are a widely used category of models, which cast the task as finding the optimal maximum spanning tree in the directed graph. Graphbased models provide a more global view than shift-  Figure 1: Two possible dependency trees for sentence "I love Tim's cat". For the tree on the right hand side, if we look at the token-token level, 'love" being linked to "cat" is correct. But at the subtree-subtree level, the linking is incorrect since the span/subtree behind "cat" is incorrect.
reduce models (Zhang and Nivre, 2011;Chen and Manning, 2014), leading to better performances. Graph-based methods are faced with a challenge: they construct dependency edges by using word pairs as basic units for modeling, which is insufficient because dependency parsing performs at the span/subtree level. For example, Figure 1 shows two possible dependency trees for the sentence "I love Tim's cat". In both cases, at the token level, "love" are linked to "cat". If we only consider tokentoken relations, the second case of 'love" being linked to "cat" is correct. But if we view the tree at the subtree-subtree level, the linking is incorrect since the span/subtree behind "cat" is incorrect. Although higher-order methods are able to alleviate this issue by aggregating information across adjacent edges, they can not fully address the issue. In nature, the token-token strategy can be viewed as a coarse simplification of the span-span (subtree-subtree) strategy, where the root token in the token-token strategy can be viewed as the average of all spans covering it. We would like an approach that directly models span-span relations using exact subtrees behind tokens, rather than the average of all spans covering it.
To address this challenge, in this work, we propose a model for dependency parsing that directly operates at the span-span relation level. The pro-posed model consists of two modules: (1) the text span proposal module which proposes eligible candidate text spans, each of which represents a subtree in the dependency tree denoted by (root, start, end); (2) and the span linking module, which constructs links between proposed spans to form the final dependency tree. We use the machine reading comprehension (MRC) framework as the backbone to formalize the span linking module in an MRC setup, where one span is used as a query to extract the text span/subtree it should be linked to. In this way, the proposed model is able to directly model span-span relations and build the complete dependency tree in a bottom-up recursive manner.
The proposed model provides benefits in the following three aspects: (1) firstly, it naturally addresses the shortcoming of token-token modeling in vanilla graph-based approaches and directly performs at the span level; (2) with the MRC framework, the left-out spans in the span proposal stage can still be retrieved at the span linking stage, and thus the negative effect of unextracted spans can be alleviated; and (3) the MRC formalization allows us to take advantage of existing state-of-the-art MRC models, with which the model expressivity can be enhanced, leading to better performances.
We are able to achieve new SOTA performances on PTB, CTB and UD benchmarks, which demonstrate the effectiveness of the proposed method.

Related Work
Transition-based dependency parsing incrementally constructs a dependency tree from input words through a sequence of shift-reduce actions (Zhang and Nivre, 2011;Chen and Manning, 2014;Zhou et al., 2015;Dyer et al., 2015;Yuan et al., 2019;Han et al., 2019;Mohammadshahi and Henderson, 2020a). Graph-based dependency parsing searches through the space of all possible dependency trees for a tree that maximizes a specific score (Pei et al., 2015;Wang et al., 2018;Zhang et al., 2019).
Graph-based dependency parsing is first introduced by McDonald et al. (2005a,b). They formalized the task of dependency parsing as finding the maximum spanning tree (MST) in directed graphs and used the large margin objective (Crammer et al., 2006) to efficiently train the model. Zhang et al. (2016) introduced a probabilistic convolutional neural network (CNN) for graph-based dependency parsing to model third-order dependency information Wang and Chang (2016); Kiperwasser and Goldberg (2016) proposed to employ LSTMs as an encoder to extract features, which are then used to score dependencies between words. Zhang and Zhao (2015); Zhang et al. (2019); Wang and Tu (2020) integrated higher-order features across adjacent dependency edges to build the dependency tree. Ji et al. (2019) captured high-order dependency information by using graph neural networks. The biaffine approach (Dozat and Manning, 2016) is a particular kind of graph-based methods improving upon vanilla scoring functions in graph-based dependency parsing. Ma et al. (2018) combined biaffine classifiers and pointer networks to build dependency trees in a top-down manner. Jia et al. (2020); Zhang et al. (2020) extended the biaffine approach to the conditional random field (CRF) framework. Mrini et al. (2020) incorporated label information into the self-attention structure (Vaswani et al., 2017b) for biaffine dependency parsing.

Notations
Given a sequence of input tokens s = (w 0 , w 1 , ..., w n ), where n denotes the length of the sentence and w 0 is a dummy token representing the root of the sentence, we formalize the task of dependency parsing as finding the tree with the highest score among all possible trees rooted at w 0 .
Each token w i in the input sentence corresponds to a subtree T w i rooted at w i within in the full tree T , and the subtree can be characterized by a text span, with the index of its leftmost token being T w i .s in the original sequence, and the index of its rightmost token being T w i .e in the original sequence. As shown in the first example of Figure  1, the span covered by the subtree T love is the full sentence "I love Tim's cat", and the span covered by the subtree T cat is "Tim's cat". Each directional arc w i → w j in T represents a parent-child relation between T w i and T w j , where T w j is a subtree of T w i . This implies that the text span covered by T w j is fully contained by the text span covered by T w i . It is worth noting that the currently proposed paradigm can only handle the projective situation. We will get back to how to adjust the current paradigm to non-projective situation in Section 3.5.

Scoring Function
With notations defined in the previous section, we now illustrate how to compute the score(T w 0 ) in Eq.(1). Since we want to model the span-span relations inside a dependency tree, where the tree is composed by spans and the links between them, we formalize the scoring function as: where score span (T w i ) represents how likely the subtree rooted at w i covers the text span from T.s to T.e. score link (T w i , T w j ) represents how likely tree T w j is a subtree of T w i , i.e. there is an arc from w i to w j , and λ is a hyper-parameter to balance score span and score link . We will illustrate the details how to compute score span (T ) and score link (T 1 , T 2 ) in the following sections. Table 1 shows all the spans and links for the left tree in Figure 1.

Span Proposal Module
In this section, we introduce the span proposal module. This module gives each tree T w i a score score span (T w i ) in Eq.(2), which represents how likely the subtree rooted at w i covers the text span from T w i .s to T w i .e. The score can be decomposed into two components -the score for the left half span from w i to T w i .s, and the score for the right half span from w i to T w i .e, given by: We propose to formalize score start (T w i .s|w i ) as the score for the text span starting at T w i .s, ending at w i , by transforming the task to a text span extraction problem. Concretely, we use the biaffine function (Dozat and Manning, 2016) to score the text span by computing score start (j|i) -the score of the tree rooted at at w i and staring at w j : where U ∈ R d×d and w ∈ R d are trainable parameters, x i ∈ R d and x j ∈ R d are token representations of w i and w j respectively. To obtain x i and x j , we pass the sentence s to pretrained models such as BERT (Devlin et al., 2018). x i and x j are the last-layer representations output from BERT for w i and w j . We use the following loss to optimize the left-half span proposal module: This objective enforces the model to find the correct span start T w i .s for each word w i . We ignore loss for w 0 , the dummy root token. score end (T w i .e|w i ) can be computed in the similar way, where the model extracts the text span rooted at index w i and ending at T w i .e: The loss to optimize the right-half span proposal module: Using the left-half span score in Eq.(4) and the right-half span score in Eq.(6) to compute the full span score in Eq.
(3), we are able to compute the score for any subtree, with text span starting at T w i .s, ending at T w i .e and rooted at w i .

Span Linking Module
Given two subtrees T w i and T w j , the span linking module gives a score -score link (T w i , T w j ) to represent the probability of T w j being a subtree of T w i . This means that T w i is the parent of T w j , and that the span associated with T w j , i.e., (T w j .s, T w j .e) is fully contained in the span associated with T w i , i.e., (T w i .s, T w i .e).
We propose to use the machine reading comprehension framework as the backbone to compute this score. It operates on the triplet {context (X), query (q) and answer (a)}. The context X is the original sentence s. The query q is the child span (T w j .s, T w j .e). And we wish to extract the answer, which is the parent span (T w i .s, T w i .e) from the context input sentence s. The basic idea here is that using the child span to query the full sentence gives direct cues for identifying the corresponding parent span, and this is more effective than simply feeding two extracted spans and then determining whether they have the parent-child relation.

Constructing Query
Regarding the query, we should consider both the span and its root. The query is thus formalized as follows: (8) where <sos>, <sor>, <eor>, and <eos> are special tokens, which respectively denote the start of span, the start of root, the end of root, and the end of span. One issue with the way above to construct query is that the position information of T w j is not included in the query. In practice, we turn to a more convenient strategy where the query is the original sentence, with special tokens <sos>, <sor>, <eor>, and <eos> used to denote the position of the child. In this way, position information for child T w j can be naturally considered.
Answer Extraction The answer is the parent, with the span T w i .s, T w i .e rooted at T w i . We can directly take the framework from the MRC model by identifying the start and end of the answer span, respectively denoted by score s parent (T w i .s|T w j ) and score e parent (T w i .e|T w j ). We also wish to identify the root T w i from the answer, which is characterized by the score of w i being the root of the span, denoted by score r parent (w i |T w j ). Furthermore, since we also want to identify the relation category between the parent and the child, the score signifying the relation label l is needed to be added, which is denoted by score l parent (l|T w j , w i ). For quadruple (T w i .s, T w i .e, T w j , l), which denotes the span T w i .s, T w i .e rooted at w i , the final score for it being the answer to T w j , and the relation between the subtrees is l, is given by: In the MRC setup, the input is the concatenation of the query and the context, denoted by {<cls>, query, <sep>, context}, where <cls> and <sep> are special tokens. The input is fed to BERT, and we obtain representations for each input token. Let h t denote the representation for the token with index t output from BERT. The probability of t th token being the root of the answer, which is denoted by score r parent (w t |T w j ) is the softmax function over all constituent tokens in the context: where h root is trainable parameter. score s parent and score e parent can be computed in the similar way: For score l parent (l|T w j , w i ), which denotes the relation label between T w i and T w j , we can compute it in a simple way. Since h w i already encodes information for h w j through self-attentions, the representation h w i for w i is directly fed to the softmax function over all labels in the label set L: Mutual Dependency A closer look at Eq.(9) reveals that it only models the uni-directional dependency relation that T w i is the parent of T w j . This is suboptimal since if T w i is a parent answer of T w j , T w j should be a child answer of T w i . We thus propose to use T w i as the query q and T w j as the answer a.
The final score score link is thus given by: Since one tree may have multiple children but can only have one parent, we use the multi-label cross entropy loss L parent for score parent (T w i |T w j ) and use the binary cross entropy loss L child for score child (T w j |T w i ). We jointly optimize these two losses L link = L parent + L child for span linking.

Inference
Given an input sentence s = (w 0 , w 1 , w 2 , ..., w n ), the number of all possible subtree spans (w i , T w i .s, T w i .e) is O(n 3 ), and therefore running MRC procedure for every candidate span is computationally prohibitive. A naive solution is to use the span proposal module to extract top-k scored spans rooted at each token. This gives rise to a set of span candidates T with size 1 + n × k (the root token w 0 produces only one span), where each candidate span is associated with its subtree span score score span (·). Then we construct the optimal dependency tree based only on these extracted spans by linking them. This strategy obtains a local optimum for Eq.
(2), because we want to compute the optimal solution for the first part n i=1 score span (T w i ) depending on the second part of Eq.(2), i.e., (w i →w j )∈Tw 0 score link (T w i , T w j ). But in this naive strategy, the second part is computed after the first part.
It is worth noting that the naive solution of using only the top-k scored spans has another severe issue: spans left out at the span proposal stage can never be a part of the final prediction, since the span linking module only operates on the proposed spans. This would not be a big issue if topk is large enough to recall almost every span in ground-truth. However, span proposal is intrinsically harder than span linking because the span proposal module lacks the triplet span information that is used by the span linking module. Therefore, we propose to use the span linking module to retrieve more correct spans. Concretely, for every span T w j proposed by the span proposal module, we use arg max score parent (T w i |T w j ) to retrieve its parent with the highest score as additional span candidates. Recall that span proposal proposed 1 + n × k spans. Added by spans proposed by the span linking module, the maximum number of candidate spans is 1 + 2 × n × k. The MRC formalization behind the span linking module improves the recall rate as missed spans at the span proposal stage can still be retrieved at this stage.
Projective Decoding Given retrieved spans harvested in the proposal stage, we use a CKY-style bottom-up dynamic programming algorithm to find the projective tree with the highest score based on Eq.
The key idea is that we can generalize the definition of score(T w 0 ) in Eq.(2) to any w by the following

Algorithm 1: Projective Inference
Input :Input sentence s, span candidates T , span scores scorespan(T ), ∀T ∈ T Output :Highest score of every span score(T ), ∀T ∈ T / * Compute linking scores based on Eq. (14) * / scorelink(T1, T2) ← scoreparent(T1|T2) + scorechild(T2|T1), ∀(T1, T2) ∈ T / * Compute score(T ), ∀T ∈ T * / for len ← 0 to n do for T ← T do if T.e − T.s = len then / * T covers a single word where {T w i | T w i ⊆ T w , i = 0, 1, ..., n} is all subtrees inside T w , i.e. there is a path in T w like w → w i 1 → ..., → w i Using this definition, we can rewrite score(T w ) in a recursive manner: ..n} is the set of all direct subtrees of T w . The full algorithm of projective decoding is present at Algorithm 1 in the supplementary material.
Non-Projective Decoding It is noteworthy that effectively finding a set of subtrees composing a tree T requires trees to be projective (the projective property guarantees every subtree is a continuous span in text), and experiments in Section 4 shows that this algorithm performs well on datasets where most trees are projective, but performs worse when a number of trees are non-projective. To address this issue, we adapt the proposed strategy to the MST (Maximum Spanning Tree) algorithm (Mc-Donald et al., 2005b). The key point of MST is to obtain the score for each pair of tokens w i and w j (rather than spans) , denoted by score edge (w i , w j ). We propose that the score to link w i and w j is the highest score achieved by two spans respectively rooted at w i and w j : The final score for tree T is given by: Here, MST can be readily used for decoding.

Datasets and Metrics
We carry out experiments on three widely used dependency parsing benchmarks: the English Penn Treebank v3.0 (PTB) dataset (Marcus et al., 1993), the Chinese Treebank v5.1 (CTB) dataset (Xue et al., 2002) and the Universal Dependency Treebanks v2.2 (UD) (Nivre et al., 2016) where we select 12 languages for evaluation. We follow Ma et al. (2018) to process all datasets. The PTB dataset contains 39832 sentences for training and 2416 sentences for test. The CTB dataset contains 16091 sentences for training and 1910 sentences for test. The statistics for 12 languages in UD dataset are the same with Ma et al. (2018). We use the unlabeled attachment score (UAS) and labeled attachment score (LAS) for evaluation. Punctuations are ignored in all datasets during evaluation.

Experiment Settings
We compare the proposed model to the following baselines: (1) Biaffine, (2) StackPTR, (3)GNN, (4) MP2O, (5) CVT, (6) LRPTR, (7) HiePTR, (8) TreeCRF, (9) HPSG, (10) HPSG+LA, (11) MulPTR, (12) SynTr. The details of these baselines are left to the supplementary materials due to page limitation. We group experiments into three categories: without pretrained models, with BERT and with RoBERTa. To implement a spanprediction parsing model without pretrained models, we use the QAnet (Yu et al., 2018) for span prediction. To enable apple-to-apple comparisons, we implement our proposed model, the Biaffine model, MP2O (Wang and Tu, 2020) based on BERT large (Devlin et al., 2018) and RoBERTa large  for PTB, BERT and RoBERTawwm large (Cui et al., 2019) for CTB, BERTBase-Multilingual-Cased and XLM-RoBERTa large for UD. We apply both projective decoding and nonprojective MST decoding for all datasets. For all experiments, we concatenate 100d POS tag embedding with 1024d pretrained token embeddings, then project them to 1024d using a linear layer. Following Mrini et al. (2020), we further add 1-3 additional encoder layers on top to let POS embeddings well interact with pretrained token embeddings. POS tags are predicted using the Stanford NLP package . We tried two different types of additional encoders: Bi-LSTM (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017a). For Bi-LSTM, the number of hidden size is 1024d. For Transformer, the number of attention heads and hidden size remain the same as pretrained models (16 for attention heads and 1024d for hidden size). We use 0.1 dropout rate for pretrained models and 0.3 dropout rate for additional layers. We use Adam (Kingma and Ba, 2014) as optimizer. The weight parameter λ is tuned on the development set.

Main Results
Table 2 compares our model to existing state-of-theart models on PTB/CTB test sets. As can be seen, for models without pretrained LM, the proposed span-prediction model based on QAnet outperforms all baselines, illustrating the effectiveness of the proposed span-prediction framework for dependency parsing. For BERT-based models, the proposed span-prediction models outperform Biaffine model based on BERT, along with other competitive baselines. On PTB, performances already outperform all previous baselines, except on the LAS metric in comparison to HiePTR (95.46 vs. 95.47) on PTB, but underperform RoBERTa-based models. On CTB, the proposed span-prediction model obtains a new SOTA performance of 93.14% UAS. For RoBERTa-based models, the proposed model achieves a new SOTA performance of 97.24% UAS and 95.49% LAS on PTB. As PTB and CTB con-  Table 2: Results for different models on PTB and CTB. : These approaches utilized both dependency and constituency information in their approach, thus is not comparable to ours.
tain almost only projective trees, the projective decoding strategy significantly outperforms the nonprojective MST algorithm. It is worth noting that, since MulPTR, HPSG and HPSG+LA rely on additional labeled data of constituency parsing, results for HPSG are not comparable to ours. We list them here for reference purposes. Table 3 compares our model with existing stateof-the-art methods on UD test sets. Other than es, where the proposed model slightly underperforms the SOTA model by 0.02, the proposed model enhanced with XLM-RoBERTa achieves SOTA performances on all other 11 languages, with an average performance boost of 0.3. As many languages in UD have a notable portion of non-projective trees, MST decoding significantly outperforms projective decoding, leading to new SOTA performances in almost all language sets.

Ablation Study and Analysis
We use PTB to understand behaviors of the proposed model. As projective decoding works best for PTB, scores reported in this section are all from projective decoding.

Effect of Candidate Span Number
We would like to study the effect of the number of candidate spans proposed by the span proposal module, i.e., the value of k. We vary the value of k from 1 to 25. As shown in Table 4, increasing values of k leads to higher UAS, and the performance stops increasing once k is large enough (k > 15). More interestingly, even though k is set to 1, which means that only one candidate span is proposed for each word, the final UAS score is 96.94, a score that is very close to the best result 97.24 and surpasses most existing methods as shown in Table 2. These results verify that the proposed approach can accurately extract and link the dependency spans.

Effect of Span Retrieval by Span Linking
As shown in Table 5, span recall significantly improves with the presence of the span linking stage. This is in line with our expectation, since spans missing at the proposal module can be retrieved by QA model in the span linking stage. Recall boost narrows down when k becomes large, which is expected as more candidates are proposed at the proposal stage. The span linking stage can improve computational efficiency by using a smaller number of proposed spans while achieving the same performance.

Effect of Scoring Functions
We study the effect of each part of the scoring functions used in the proposed model. Table 6 shows the results. We have the following observations: (1) token(query)-token(answer): we simplify the model by only signifying root token in queries (child) and extract the root token in the context (parent). The model actually degenerates into a model similar to Biaffine by working at the tokentoken level. We observe significant performance decreases, 0.57 in UAS and 0.34 in LAS.
(2) token(query)-span(answer): signifying only token in queries (child) and extracting span in answers (parent) leads to a decrease of 0.13 and 0.08    (3) span(query)-token(answer): signifying spans in queries (child) but only extracting token in answers (parent) leads to a decrease of 0.07 and 0.05 respectively for UAS and LAS. (1), (2) and (3) demonstrate the necessity of modeling span-span rather than token-token relations in dependency parsing: replacing span-based strategy with tokenbased strategy for either parent or child progressively leads to performance decrease.
(4) Removing the Mutual Dependency module which only uses child → parent relation and ignores parent → child relation also leads to performance decrease.  Dependency Length. Figure 2(b) shows the results with respect to dependency length. The proposed parser shows its advantages on long-range dependencies. We suppose span-level information is beneficial for long-range dependencies. Subtree Span Length. We further conduct experiments on subtree span length. We divide the average lengths of the two spans in the span linking module into seven buckets. We suppose our parser should show advantages on long subtree span, and the results in Figure 2(c) verify our conjecture. In summary, the span-span strategy works significantly better than the token-token strategy, especially for long sequences. This explanation is as follows: the token-token strategy can be viewed as a coarse simplification of the span-span strategy, where the root token in the token-token strategy can be viewed as the average of all spans covering it, while in the span-span strategy, it represents the exact span, rather than the average. The deviation from the average is relatively small from the extract when sequences are short, but becomes larger as sequence length grows, since the number of spans covering the token exponentially grows with length. This makes the token-token strategy work significantly worse for long sequences.

Conclusion
In this paper, we propose to construct dependency trees by directly modeling span-span instead of token-token relations. We use the machine reading comprehension framework to formalize the span linking module, where one span is used as a query to extract the text span/subtree it should be linked to. Extensive experiments on the PTB, CTB and UD benchmarks show the effectiveness of the proposed method.