N-ary Constituent Tree Parsing with Recursive Semi-Markov Model

In this paper, we study the task of graph-based constituent parsing in the setting that binarization is not conducted as a pre-processing step, where a constituent tree may consist of nodes with more than two children. Previous graph-based methods on this setting typically generate hidden nodes with the dummy label inside the n-ary nodes, in order to transform the tree into a binary tree for prediction. The limitation is that the hidden nodes break the sibling relations of the n-ary node’s children. Consequently, the dependencies of such sibling constituents might not be accurately modeled and is being ignored. To solve this limitation, we propose a novel graph-based framework, which is called “recursive semi-Markov model”. The main idea is to utilize 1-order semi-Markov model to predict the immediate children sequence of a constituent candidate, which then recursively serves as a child candidate of its parent. In this manner, the dependencies of sibling constituents can be described by 1-order transition features, which solves the above limitation. Through experiments, the proposed framework obtains the F1 of 95.92% and 92.50% on the datasets of PTB and CTB 5.1 respectively. Specially, the recursive semi-Markov model shows advantages in modeling nodes with more than two children, whose average F1 can be improved by 0.3-1.1 points in PTB and 2.3-6.8 points in CTB 5.1.


Introduction
There are two settings for constituent parsing models, including binary tree parsing and n-ary tree parsing. In the former, the original constituent tree with n-ary nodes is converted into a binary tree by language-specific rules. The model first predicts the binary tree, and then converts it back. In the latter, the model directly predicts the n-ary tree without the intermediate step of binarization. * Xin Xin is the corresponding author. In the paper, we focus on the setting of n-ary tree parsing. Compared with binary tree parsing, which has the advantage of utilizing the lexical head information, n-ary tree parsing is more natural to fit the original tree structure, and is more adaptable to languages that do not have head rules for binarization. In addition, for languages with the word segmentation issue, such as Chinese, it is very convenient for n-ary tree parsing models to deal with the joint task of word segmentation, partof-speech (POS) tagging and constituent parsing, by just enlarging the label set with the POS labels, as shown in Fig. 1 (a), which alleviates the error propagation from the pipeline.
Specifically, we target at improving graph-based models for n-ary tree parsing, which obtain better performances in recent work (Kitaev et al., 2019;Zhang et al., 2020;Wei et al., 2020) from the two streams of well-developed parsing methods, graphbased and transition-based. For n-ary tree parsing, the main idea of previous graph-based models is to generate hidden nodes with the dummy label φ inside the n-ary node, in order to expand the n-ary tree into a binary tree. In this way, n-ary tree parsing can be converted into binary tree parsing with hidden nodes, which are unobservable in the training process. Consider the n-ary node " VP→VV,ρ(0,4) ρ (1,4) Figure 2: Comparisons of previous and our models. (i, j) denotes the span from i to j − 1. ρ(i, j) denotes the feature score of span (i, j); ψ(i, j, k) denotes the feature score of the sibling span pair (i, j) and (j, k).
NP, QP" in Fig. 1 (a) as an example. The hidden nodes can be in two manners, as shown in Fig. 1 ( b, c). Either of them can be seen as being correct in training. For convenience, the potential scores of such hidden nodes are manually set to zero, to ensure that the two manners are equivalent when calculating the likelihood (Kitaev and Klein, 2018).
The limitation of previous methods is that the generated hidden nodes break the sibling relations of the n-ary node's children. Consequently, such sibling dependency feature might not be accurately modeled and is being ignored. Consider the node "VP→VV, NP, QP" in the above example. If we model the 1-order dependency from the sibling node pair, dependency feature scores should be calculated from both pairs of (VV, NP) and (NP, QP). Without loss of generality, suppose the hidden node is as shown in Fig. 1 (b), and the case in Fig. 1 (c) is similar. As the hidden node φ is forced to be the sibling node of "QP", the dependency feature of (NP, QP) cannot be directly calculated. In implementation, only potential scores of each node are modeled, and the dependency potential scores of sibling node pairs are being ignored.
To solve this limitation, we propose a novel framework for n-ary tree parsing. Our main idea is to utilize 1-order semi-Markov model to directly predict the immediate children sequence of an n-ary node, without generating the hidden nodes for binarization, as shown in Fig. 2. Different from previous models that only have potential scores on nodes when evaluating a tree's likelihood, the potential scores of sibling node pairs are also calculated as 1-order transition features. Thus dependencies from sibling nodes can be naturally modeled, which solves the above limitation. When generating an n-ary tree, the semi-Markov model is recursively conducted on the node spans in a bottom-up manner, thus we call the proposed model "recursive semi-Markov model".
The main challenge of designing the recursive semi-Markov model is how to make the computational complexity being acceptable. In nowadays GPU era, to make full use of parallel computation is an important issue to enhance the processing speed. For example, in the previous CYK (Kasami, 1966) algorithm for binary trees, the absolute time complexity is O(n 3 ), where n is the sentence length. But O(n 2 ) out of it can be computed in parallel, by batchifying the spans with the same length and the divisions within a span. This means the hard time complexity of CYK, which cannot be computed in parallel, is O(n). In the case of the proposed recursive semi-Markov model, the time complexity of the straight-forward dynamic programming algorithm is O(n 5 ). But by careful design, we propose an algorithm, whose complexity is O(n 4 ), with O(n 3 ) out of it can be batchified. It means the increased O(n) complexity compared with CYK can be calculated in parallel. In practice, the proposed framework can process 26 and 11 sentences per second in PTB and CTB 5.1 test sets respectively, by a single NVIDIA RTX GPU.
Our main contributions can be summarized as follows.
(1) We propose a novel graph-based framework, recursive semi-Markov model, for n-ary constituent tree parsing, which can model the dependencies of sibling nodes. (2) We design a dynamic programming algorithm for the proposed framework, whose complexity is O(n 4 ), with O(n 3 ) inside can be batchified. (3) Experimental verifications demonstrate that the proposed framework outperforms previous methods. The F1 of the proposed framework is 95.92% and 92.50% in PTB and CTB 5.1 respectively. In the joint task with segmentation and POS tagging in CTB 5.1, the F1 is 91.84%. In addition, the proposed framework can effectively predict nodes with more than two children, improving the F1 by 0.3-1.1 points in PTB and 2.3-6.8 points in CTB 5.1.
2 Related Work 2.1 Early Models for N -ary Tree Parsing A representative of classical methods for n-ary tree parsing is the Earley algorithm (Earley, 1970). It can find legal trees of sentence fitting the grammar rules with the complexity of O(Cn 3 ) by dynamic programming, where n is the sentence length and C is dependent on the complexity of grammar rules. The dependency with the size of grammar rules in the Earley algorithm increases the computational complexity substantially in practice. Therefore, recent studies have paid more attention to utilizing "less grammar" (Hall et al., 2014), which is implemented in CYK/shift-reduce algorithms (Durrett and Klein, 2015;Liu and Zhang, 2017b;Teng and Zhang, 2018) instead of the Earley algorithm. It demonstrates it can reduce the complexity and also obtain better performances.
Our proposed framework is in line with the recent studies, whose complexity is independent with the size of grammar rules.

Graph-Based N -ary Tree Parsing
Graph-based parsing models utilize the CYK algorithm to find the tree with the largest feature score as the prediction. The main advantage is the large search space and the globally optimal inference. A representative of graph-based n-ary tree parsing model is the Berkeley parser Kitaev and Klein, 2018;Kitaev et al., 2019), which employs hidden nodes to deal with n-ary nodes.
The proposed framework belongs to graph-based n-ary tree parsing models. Compared with previous work, the novelty lies in that semi-Markov model is utilized to directly model the children sequence of an n-ary node, instead of generating a binary tree with hidden nodes. Consequently, it can avoid breaking the sibling relation of nodes in the sequence. The proposed framework then makes use of such dependencies to improve the parsing performance.

Transition-Based N -ary Trees Parsing
Transition-based models make predictions sequentially, with advantages of the low computational cost and the utilization of high-order features. The models can be divided into post-order (Cross and Huang, 2016;Fernández-González and Gómez-Rodríguez, 2019), pre-order (Dyer et al., 2016), and in-order (Liu and Zhang, 2017a), according to the traversal manner of the action sequence. Post-order models require to deciding the number of reduced nodes for n-ary nodes (Fernández-González and Gómez-Rodríguez, 2019), or to introducing hidden nodes with dummy label (Cross and Huang, 2016). Pre-order models and in-order models are born to have convenience in dealing with n-ary nodes, as the number of reduced nodes is fixed.
Both the proposed framework and some of the above methods directly model the sequence within an n-ary node. The novelty of the proposed framework is that it models the sequence as a graphbased model rather than a transition-based model. Transition-based models suffer from the limitation of local optimization in the inference process, but graph-based models can guarantee the globally optimal inference. In recent studies, graph-based models have been demonstrated to perform better than transition-based models (Kitaev et al., 2019;Zhang et al., 2020;Wei et al., 2020).

Preliminaries
A sentence is denoted by x = {x i }, with x i being the i th word. The sentence length is denoted by n. Let Y be the set of the alphabet constituent labels. Following previous work (Kitaev and Klein, 2018;Zhang et al., 2020), the nodes with unary grammars are collapsed, and its label is replaced by the joint label of the collapsed nodes. For example, in Fig. 1 (a), "CP→IP" will be replaced by "CP+IP", where "CP+IP" is an atomic label. Given x, the task is to build an n-ary tree on top of it, and assign a label to each internal node. When conducting the joint parsing task with word segmentation and POS tagging in Chinese, Y is enriched with the POS labels and a "C" label (denoting characters), and x i denotes the i th character. For example, in Fig. 1 (a), "NN" is a POS label, and "NP+NN" is treated as an atomic label for the corresponding node in the joint parsing task.
in Fig. 3. This graph corresponds to the tree in Fig. 1 (a). Full circles refer to the input x. Blank circles refer to the internal nodes, which can be seen as variables in the probabilistic graph. The full line, which connects two nodes, means that the two nodes are dependent with each other. The dotted line pointing to an internal node refers to the sequence of the node's immediate children. There are two kinds of cliques in the graph, the one with a single node, and the one with two sibling nodes.
The former corresponds to 0-order cliques, and the latter corresponds to 1-order cliques. The whole framework is a 1-order semi-Markov model.
Potential scores, which are assigned to the above two kinds of cliques, are denoted by ρ(i, j, l|x, θ), and ψ(i, j, k, l 1 , l 2 |x, θ), respectively. θ is the model parameters, including neural network weights and word embeddings. In the following, we omit the symbol x and θ in equations for presentation simplicity. ρ(i, j, l) defines the emission feature score of a span, describing how likely the span is a constituent. (i, j) denotes a span which starts at i and ends at j − 1, 0 ≤ i < j ≤ n. l ∈ Y denotes the span's label. ψ(i, j, k, l 1 , l 2 ) defines the transition feature score of two sibling spans, describing how likely the two spans are sibling neighbors within an n-ary node. (i, j, k) denotes the two sibling spans (i, j) and (j, k). l 1 is the label of the left span, and l 2 is the label of the right span.
Let y denote a predicted tree given x. The conditional probability p(y|x) can be defined on the probabilistic graph, under the framework of conditional random fields (CRF) (Lafferty et al., 2001), as shown in the following equations. C 1 (y) denotes the set of emission scores, and C 2 (y) denotes the set of transition scores. T (x) denotes all legal n-ary trees that can be built on top of the input sentence x. s(y) is the sum of clique potential scores defined in a whole tree, with two examples shown in Fig. 4. Given the parameters θ, the inference process is to find a tree with the largest probability.

Potential Score Calculations
Given an input sentence x, we follow the neural network architecture of the Berkeley parser (Kitaev and Klein, 2018) with some minor revisions, to calculate the two kinds of potential scores, ρ(i, j, l), and ψ(i, j, k, l 1 , l 2 ), as shown in Fig. 5.
In the embedding layer, the BERT (Devlin et al., 2019;Wolf et al., 2020) is selected to generate pretrained vectors, denoted by e i , 0 ≤ i < n. For the Chinese language, e i refers to the i th character, and the embedding vector of last character within the word is chosen to represent the word.
In the encoding layer, the Transformer (Vaswani et al., 2017) is selected for extracting the context features, denoted by h i , with odd dimensions − → h i and the even dimensions The representation of a single span (i, j) is , and the representation of a sibling span pair (i, j) and is the concatenate operation. By passing v(i, j) and v(i, j, k) through multi-layer perceptrons (MLP), the emission potential score is finally defined as ρ(i, j, l) = MLP emission l (v(i, j)), and the transition potential score is defined as , j, k)).
There are totally |Y| MLPs for ρ. |Y| is the size of the label set Y. Parameters in the hidden layers are shared among them, and only the parameters of the output layers are different to distinguish different labels. Similarly, there are |Y| 2 MLPs for ψ, whose parameters in hidden layers are also shared.

The Max-Margin Loss
When designing the loss function, theoretically, we can follow the CRF framework to optimize the loglikelihood of the training data. But in practice, if we do this, the gradients of all potential scores, which is O(n 4 ) (n is the sentence length), should be stored in the GPU memory. This is impossible to be implemented in a general GPU device. Therefore, we employ the max-margin loss as the training objective to learn the parameters of the proposed framework, following the Berkeley parser (Kitaev and Klein, 2018). By max-margin, only the gradients of the predicted tree structure and the gold structure need to be stored, which is O(n). Consequently, it saves a lot of memory in implementation.
Let s(y) in Eq. 1 denote the total potential score of a tree y. Suppose the gold tree is y g , with the potential score s(y g ). The key idea of the maxmargin loss is to let the maximum potential score of the other trees, denoted as s(y * ), be less than s(y g ) by an acceptable margin. In the probability space, it is equivalent that the probability of the gold tree is larger than the maximum probability of the other trees by a margin. The formal definition of the objective is to minimize the following hinge loss, where ∆(y, y g ) refers to the number of spans in y g not matched in y.

Explanations of the Proposed Model
The Semi-Markov Property. The semi-Markov property of the proposed model refers to the one (3) Figure 6: Comparisons between the CYK algorithm and the recursive semi-Markov model. mentioned in Sarawagi and Cohen's work (Sarawagi and Cohen, 2004). When finding the immediate children of a constituent span, the linear-chain Markov structures are assumed over the sequence of candidate immediate constituents. In the implementation, we treat it as a segmentation problem, where each immediate child span can be seen as a segment, which has the similar setting with the previous work (Sarawagi and Cohen, 2004). Compared with the traditional "B-I-O" tagging schema in segmentation, which assigns a label to each token, the emission feature ρ is defined on the whole segment of several tokens in the proposed model, which is non-Markovian. Markov property exists in adjacent segments from the transition feature ψ. This shows the semi-Markov property.
Connections with CRF. Traditional CRF models define a conditional probability over a probabilistic graph, and utilize the maximum likelihood estimation as the optimization objective. The proposed model shares the same conditional probability definition from the explanation view, but utilizes a margin-based loss in order to save the computational memory.

The Challenge
The core for the optimization is to find the tree with the maximum potential score. The previous CYK algorithm utilizes dynamic programming to find the maximum score, in a bottom-up manner. In order to calculate the maximum score of a given span, all the divisions should be enumerated. As shown in Fig. 6 (left), in the binary tree case, the number of the divisions is equal to L − 1, where L is the span length. Besides, the span length should be enumerated from 1 to n, and for each span length L there is L − n + 1 spans. Therefore the total time complexity of previous CYK is O(n 3 ). In our case, a span can have more than two immediate children. Therefore, all the segmentation sequences should be enumerated, which obviously enlarges the search space. In Fig. 6 (right), for a span with the length equal to 4, the number of sequences to be considered increases from 3 to 7. This difference is the key issue to be solved in this section.

Straight-Forward Algorithm (O(n 5 ))
Let (i, j) be a representative span (i < j). We need to find its immediate children sequence with the maximum potential score. Dynamic programming is employed to accumulate the maximum potential score from the left to the right. Let α(i, j , d, l) be an accumulated variable in the dynamic programming, which accumulates potential scores from j = i + 1 to j = j. j denotes the current accumulated position. d (i < d < j ) means that the last immediate child for span (i, j ) is the span (d, j ). l refers to the label of (d, j ). The meaning of α(i, j , d, l) is the maximum accumulated score chosen from all the immediate children sequences of span (i, j ) whose last immediate child is (d, j ) with the label l. We also include the case of d = i, which refers to the maximum accumulated score of the span (i, j )'s children and the span (i, j ) itself with l as its label.
In semi-Markov model, the above iterative calculation equations hold for the dynamic programming. The first equation is the initial state when j = i + 1, and the second and third equations are the iterative functions when (i < d < j , j > i+1) and (d = i, j > i + 1), respectively. An example of the dynamic programming is shown in Fig. 7.
In the iterative calculation of the above dynamic programming, we need to enumerate q, d, j , i, j, each of which has the complexity of O(n). The total time complexity of the straight-forward method is O(n 5 ) * O(|Y| 2 ). To simplify the complexity of |Y| 2 , in calculating ψ(q, d, j , l , l), we manually group the labels in Y into clusters, according to the meaning of the constituent label, which reduces the complexity of |Y| 2 . Consequently, the main complexity comes from the O(n 5 ) part.

The Proposed Algorithm (O(n) * O p (n 3 ))
In this section, we introduce how to reduce the above complexity of O(n 5 ) to O(n) * O p (n 3 ). O p (n 3 ) means all the O(n 3 ) calculations can be batchfied. The hard complexity, which cannot be computed in parallel, is O(n).
The overall procedure for designing the algorithm is shown in Fig. 8. It includes four steps for reducing or batchifying the time complexity. In the first step, the complexity is reduced from O(n 5 ) to O(n 4 ) by sharing the α values in a set of spans. As shown in Fig. 8 (a), in the span of (0, 5), we need to calculate α(0, j, d, l) by enumerating j from 1 to 5. But the value α(0, 4, d, l) has been calculated in the span of (0, 4). Iteratively, all the values of α(0, j, d, l)(0 < j < 5) have been calculated in previous spans starting from 0. This means a set of spans that have the same start position can share the α values. If we enumerate the span length in the ascending order, in span (i, j), only the j th position's value α(i, j, d, l) needs to be calculated, instead of enumerating the position j from i + 1 to j, which reduces O(n) of the time complexity. In the second step, the complexity is batchified Algorithm 1 Algorithm for recursive semi-Markov model. Input: sentence x (length N ), model parameters θ. Outputs: the constituent tree y * with the maximum potential score s(y * |x; θ).
from O(n 4 ) to O(n 3 ) * O p (n), by computing the spans of the same length in parallel, as shown in Fig. 8 (b). In the third step, the complexity is batchified from O(n 3 ) * O p (n) to O(n 2 ) * O p (n 2 ), by computing different ds in α(i, j, d, l), i < d < j in parallel, as shown in Fig. 8 (c). In the fourth step, the complexity is batchified from O(n 2 ) * O p (n 2 ) to O(n) * O p (n 3 ), by computing α(i, j, d, l) when enumerating the second last immediate child with i < q < d in parallel (To calculate the dynamic programming state at a new position given the last child, we need to enumerate previous states with different second last children, in order to calculate ψ), as shown in Fig. 8 (d).
The details of the proposed algorithm are shown in Alg. 1. The calculation of ρ(i, j, l|x; θ) and ψ(i, j, k, l 1 , l 2 |x; θ) can be easily computed in parallel, with the complexity O p (n 2 ) and O p (n 3 ), respectively. The complexity of calculating α is O(n) * O p (n 3 ). Therefore, the total time complexity of the proposed algorithm is O(n) * O p (n 3 ).

Experimental Setup
We evaluate the proposed framework in both English and Chinese, on the datasets of PTB (WSJ sections (Marcus et al., 1993)) and CTB 5.1 (Xue et al., 2005), respectively. For Chinese, we evaluate both the single task of constituent parsing and the joint task with word segmentation and POS tagging. We follow the standard split of the datasets (Kitaev   va et al., 2003) to generate the POS tags as input, which leads to a fixed error propagation. In this paper, POS tags are removed and not used as input features in both training and testing in CTB 5.1, following the previous work in (Zhang et al., 2020). Standard precision, recall and F1-measure are employed as evaluation metrics, where the EVALB 1 tool is employed in the single task. The hyperparameters in the implementation are shown in Table. 1. Most of them are set following the Berkeley parser (Kitaev and Klein, 2018). When choosing the pre-train models (Wolf et al., 2020), "bert-largecased" is utilized for English with a single RTX 3090, "bert-base-chinese" is utilized for Chinese with a single RTX 1080TI.

Performances
The overall performances of the proposed framework in the single task of constituent parsing on   Table 4: Joint-task performances on test set of CTB.
The "baseline" row shows our running results using a revision of the Berkeley parser (Kitaev et al., 2019). the test set are shown in Table. 2 and Table. 3. The baselines in the first block are mainly based on basic word embeddings, and the baselines in the second block are based on BERT (Wolf et al., 2020). It can be observed that the F1-measures of proposed framework are 95.92% in PTB and 92.50% in CTB 5.1, which outperform the previous state-of-the-art methods. Our implementation for the proposed framework is based on the Berkeley parser (Kitaev et al., 2019). Therefore, many settings are similar with it for fair comparisons, such as learning schedule and feature normalization. Our method outperforms it by 0.33 points in PTB and 0.5 points in CTB 5.1 (0.25% of the 0.75% improvement is due to not utilizing automatically predicted POS tags in CTB 5.1), which demonstrates the advantage of modeling the sibling dependency features. The overall performances in the joint task on the test set of CTB 5.1 are shown in Table. 4. As there are rare reports of performances with the BERT embedding, we have implemented a minor revision to the previous Berkeley parser (Kitaev et al.,  to make it adaptable to the joint task, which serves as the baseline method in the second block. It can observed that the F1-measures of proposed recursive semi-Markov model outperforms the competitive baseline by 0.46 points in F1, and consistently outperforms previous method in all tasks of word segmentation, POS tagging, and parsing.
The main improvement of the proposed framework comes from modeling the sibling dependencies of an n-ary node's children sequence. It has special advantage for predicting nodes with more children. We have divided all the constituent nodes into bins by how many children they have. Figure 9 shows the comparisons. The improvement is more obvious when the number of children becomes larger. For nodes with more than 2 immediate children, our framework outperforms the baseline by 0.3 to 1.1 points in PTB and 2.3 to 6.8 points in CTB 5.1.

Model
Sent./Sec. Zhu et al. (2013) 90  76 Shen et al. (2018) 111 Gómez-Rodríguez and Vilares (2018) 780 Zhou and Zhao (2019) 159 Wei et al. (2020) 220 Zhang et al. (2020) 1092 Ours 26  The average processing speed in PTB test set is 26 sentences per second with a single RTX 3090, and the one in CTB 5.1 test set is 11 sentences per second with a single RTX 1080TI (or 20 sentences per second with single RTX 3090). Table 5 shows the speed comparisons of the proposed model with previous methods in the PTB dataset. Figure 10 shows the detailed processing speed of the proposed model in CTB 5.1 dataset. Figure 10 (left) shows the processing speeds with different sentence lengths; and Fig. 10 (right) shows the processing time of some special long sentences. For the longest sentence in the CTB 5.1, which contains 240 words, it takes around 6 seconds. Fig-ure 11 shows the processing speed ratio between the Berkeley parser (Kitaev et al., 2019) and our model. It demonstrates that ratio does not grow linearly, by making full use of parallel computations. We know that the speed is still slower than some previous methods. On one hand, our proposed algorithm has already reduced the complexity by parallel computations. On the other hand, by considering its advantage in modeling nodes with multiple children, which especially happens a lot in the joint parsing task with segmentation and POS tagging in Chinese, the processing speed is still acceptable in many offline cases.

A Further Comparison on Fine-Grained Noun Phrase Structures
Within the nodes having more than two children, some of them are noun phrases, whose internal hierarchical structures have been annotated in the PTB dataset by previous work Curran, 2007, 2011). We have also conducted experiments with the Berkeley parser (Kitaev et al., 2019) on this refined PTB data. In the test process, we convert the generated fine-grained trees back to the original trees for comparisons. The F1 in the refined PTB test dataset by the Berkeley parser (Kitaev et al., 2019) is 95.62%, which is also outperformed by the proposed method in Table 2.

Conclusion
In this paper, a recursive semi-Markov model is proposed for n-ary constituent tree parsing, with the advantage of modeling the sibling relations within n-ary node. Experimental verifications on PTB and CTB 5.1 demonstrate that the proposed framework outperforms previous work in the single parsing task of both datasets and the joint task in CTB 5.1. For constituent nodes with more than 2 children, the F1 can be improved by 0.3 − 1.1 points in PTB and 2.3 − 6.8 points in CTB 5.1.