Neural Combinatory Constituency Parsing

We propose two fast neural combinatory models for constituency parsing: binary and multi-branching. Our models decompose the bottom-up parsing process into 1) classification of tags, labels, and binary orientations or chunks and 2) vector composition based on the computed orientations or chunks. These models have theoretical sub-quadratic complexity and empirical linear complexity. The binary model achieves an F1 score of 92.54 on Penn Treebank, speeding at 1327.2 sents/sec. Both the models with XLNet provide near state-of-the-art accuracies for English. Syntactic branching tendency and headedness of a language are observed during the training and inference processes for Penn Treebank, Chinese Treebank, and Keyaki Treebank (Japanese).


Introduction
Transition-based and chart-based methods are two main paradigms for constituency parsing. Transition-based parsers (Dyer et al., 2016;Kitaev and Klein, 2020) build a tree with a sequence of local actions. Despite their O(n) computational complexity, the locality makes them less accurate and necessitates additional grammars or lookahead features for improvement (Kuhlmann et al., 2011;Zhu et al., 2013;Liu and Zhang, 2017c). By contrast, chart-based parsers are conceptually simple and accurate when used with a CYK-style algorithm (Kitaev and Klein, 2018;Zhou and Zhao, 2019) for finding the global optima. However, their complexity is O(n 3 ). To achieve both accuracy and simplicity (without high complexity) is a critical problem in parsing.
Recent efforts were made using neural models. In contrast to earlier symbolic approaches (Charniak, 2000;Klein and Manning, 2003), neural models are simplified by utilizing their adaptive distributed representation, thereby eliminating compli- Binary parsing explores the internal constituents of S. Special labels prefixed with "#" or " " are sub category placeholders caused by binarization and stratification.
cated symbolic engineering. The seq2seq model for parsing (Vinyals et al., 2015) leverages such representation to interpret the structural task as a general sequential task. With augmented data and ensemble, it outperforms the symbolic models mentioned in Petrov et al. (2006) and provides a complexity of O(n 2 ) with the attention mechanism (Bahdanau et al., 2015). However, its performance is inferior to those of specialized neural parsers (Liu and Zhang, 2017a,b,c). Socher et al. (2013) proposed a parsing strategy for a symbolic constituent parser augmented with neural vector compositionality. It did not outperform the two paradigms in neural style probably because the neural techniques, such as contextualization, are not fully exploited. Kitaev and Klein (2020) showed that a simple transitionbased model with a dynamic distributed representation, BERT (Devlin et al., 2019), nearly delivers a state-of-the-art performance.
We propose a pair of greedy combinatory parsers (i.e., neural combinators) that efficiently utilize vector compositionality with recurrent components to   Figure 2: Multi-branching parsing uses chunks instead of orientations to form constituents. Chunks impose Softmax-normalized weights for their inputs. The unsupervised weights provide a shred of evidence for the headedness problem (Zwicky, 1985).
address the aforementioned issues. Their bottomup parsing process is a recursive layer-wise loop of classification and vector composition, as illustrated in Figures 1 & 2. Both parsers work on multiple unfolded variable-length layers, iteratively combining vectors until one vector remains. The binary model provides either left or right orientation for each word or constituent, whereas the multi-branching model marks chunks as constituents at their boundaries. Constituent embeddings are composed based on orientations or chunks. Tagging and labeling are directly performed on all composed embeddings, creating the elements for building a tree: tags, labels, and paths. The deterministic and greedy characteristics yield two simple and fast models, and they investigate different linguistic aspects. The contributions of our study are as follows: • We propose two combinatory parsers 1 at O(n) average-case complexity with a theoretical O(n 2 ) upper bound. The binary parser achieves a competitive F1 score on Penn Treebank. Both models are the fastest and yet more compact than many previous models.
• We extend the proposed models with a recent pre-trained language model, XLNet (Yang et al., 2019). These models have higher speeds and are comparable to state-of-the-art parsers.
• The binary model leverages Chomsky normal form (CNF) factors as a training strategy and reflects the branching tendency of a language. The multi-branching model reveals constituent headedness (Zwicky, 1985) with an attention mechanism. 3 Neural Combinatory Parsing

Data and Complexity
Our models require stratified trees to train recurrent layers, and the binary model requires further binarization. Stratification and binarization introduce redundant relaying nodes to the trees.
Tree binarization. From the bottom-up perspective, a binary tree describes the order in which words and constituents combine with their neighbors into larger constituents, as shown in Figure 3. The orientations of the four words (i.e., right-leftright-left) determine the first combination. After binarization, we label the relaying subconstituents with the parent label prefixed with an underscore mark. If terminal POS tags do not immediately form constituents, we create relaying placeholders prefixed with a hash mark 2 , as presented in Table 1. Unary branches were collapsed into a single node. Plus marks were used to join their labels (e.g., SBAR+S), and all trace branches were removed. The CNF with either a left or a 2 Multi-branching trees do not require binarization. The ' Sub' group disappears, but the '#POS' group persists.    right factor is commonly used. However, it is heuristically biased, and trees can be binarized using other balanced splits such as always splitting from the center to create a complete binary tree (mid-out) and iteratively performing left and right to create another balanced tree (mid-in). Finally, the orientation is extracted from the paths of these binary trees.
We binarized Penn Treebank (Marcus et al., 1993, PTB) for English, Chinese Treebank (Xue et al., 2005, CTB) for Chinese, and Keyaki Treebank 3 (Butler et al., 2012, KTB) for Japanese to present the syntactic branching tendencies in Table 2. As English is a right-branching language, its majority orientation is to the right. Even leftfactoring cannot reverse the trend, but it should create a greater balance. Figure 4 shows that it is less effective to stratify PTB with a right factor because it enhances the tendency. The reverse tendency emerges in the KTB corpus as Japanese is a left-branching language. For Chinese, CTB does  Complexity. Our models are trained with stratified treebanks. The complexity for inference follows the total number of nodes in each layer of a tree. There are two ideal cases: 1) Complete balanced trees with complexity O(n). They contain multiple independent phrases and enable full concurrency. 2) Trees with a single dependency core. The model reduces a constant number of nodes in each layer, resulting in O(n 2 ) complexity.
While each parse is a mixture of many cases, the empirical complexity prefers the first case. Formally, the average-case complexity can be inferred as O(n) with the help of a stable compression ratio 0 < C < 1 (C ≥ 0.5 for binary). Let m i represent the number children of the i-th tree in a general layer; the compression ratio can be stated as C = ∑ i 1 ∑ i m i . Our stratified treebanks give stable Cs for layers of different lengths, as shown in Figure 5. For the k-th layer of a sentence with n words, the number of nodes to compute can be expected to be C k ⋅ n. Based on tree height K > 0, the expected number of total parsing nodes is The partial geometric series determines an empirically linear complexity on average. Theoretically, the complexity has a quadratic upper bound. The general layer with entails the second case, where m j is the only Mary branch in each layer. The nodes shape a triangular stratified tree with an O(n 2 ) complexity. However, this case is rare, especially for long sentences Algorithm 1: Combinatory Parsing 1 Function PARSE(e0∶n; t0∶n, l 0∶k 0∶n k , o 0∶k 0∶n k or c 0∶k 0∶n k +1 ): Data structure. To summarize the data components of a treebank corpus, we used four tensors of indices for 1) words, 2) POS tags, 3) stratified syntactic labels, 4) stratified orientations, or 5) stratified chunks, where n is the length of the j-th sentence, k indicates the k-th layer of the stratified data, and n k is the layer length. ":" indicates a range of a sequence.

Combinatory Parsing
Our models comprise four feedforward (FFNN) and two bidirectional LSTM (BiLSTM) networks to decompose parsing into collaborative functions, as shown in Algorithm 1. During training, we use teacher forcing. In the inference phase, the supervised signals behind all semicolons are ignored; the predicted signals serve as their substitute. Input e 0∶n is an embedding sequence indexed by x 0∶n . In lines 2-5, the model prepares a contextual sequence for the combinator and predicts the lexical tags. Lines 6-10 describe the layer-wise loop of the combinator.
The tagging and labeling functions, FFNN tag and FFNN label , are 2-layer FFNNs. Their first layer is shared, creating a hidden layer necessary for projecting diversified situations in the manifold to the non-zero logits for the argmax decision. The core function COMPOSE 4 is either a binary Algorithm 2 or a multi-branching Algorithm 3.
Algorithm 2: Binary Compose Binary model. In Algorithm 2, the orientation function is hinted by BiLSTM ori . A single-layer FFNN ori with a threshold reduces the outputs to an integer of either 0 or 1 to indicate two possible orientations. In function BINARY, when two adjacent orientations agree as they sum to 2, their embeddings are combined by a combinatory operation. σ is the Sigmoid function, "⊕" represents concatenation, and "⊙" represents pointwise multiplication.
(See Appendix A.3 for more binary variants.) Multi-branching model. To resemble binary interpolation, we use the Softmax function for each chunk, as described in Algorithm 3. BiLSTM chk is in place of BiLSTM ori to hint FFNN chk emitting chunk signals. Segment s splits x j 0∶n j and d j 0∶n j into chunks of x chk and d chk . FFNN multi and Softmax turn d chk into attention λ chk to interpolate vector chunk x chk . Binary interpolation λ is a special case of the multi-branching λ chk because Sigmoid and Softmax functions are closely related.
To obtain the final tree representation, we apply a symbolic pruner in the same bottom-up manner to remove redundant nodes, expand the collapsed nodes, and assemble the sub-trees based on the neural outputs. (See Appendix A.4.)

Experiments
We follow previous data splits for PTB, CTB, and KTB (See Appendix A.2). The preprocessing of data is described in Section 3.1.
For the binary model, we explored interpolated dynamic datasets by sampling two CNF factored datasets. This is because of the following: 1) The experiments with the non-CNF factors did not yield any promising results; thus, we have not reported them.
2) The language was loosely left-branched, right-branched, or did not show a noticeable tendency. Moreover, the use of a single static dataset may introduce a severe orientation bias. 3) All factors are intermediate variables and equally correct. We defined the sampling strategies with two static CNF-factored datasets at certain ratios and named each strategy in the format "L%R%" according to the ratio percentages. Our experiments mainly focus on binary model B because of the aforementioned property for training parsers more accurate than multi-branching model M.
Our parsers do not contain lexical information components (Liu and Zhang, 2017c;Kitaev and Klein, 2018). Instead, we use fastText (Bojanowski et al., 2017) because we can obtain pre-trained models easily for many languages or train new ones from scratch with the corpora at hand. We examined its influence in Section 4.2, whereas the official pre-trained embeddings are the default.
Meanwhile, pre-trained language models are useful for various tasks, including constituency parsing (Kitaev andKlein, 2018, 2020;Zhou and Zhao, 2019;Yang and Deng, 2020;Mrini et al., 2020). We chose XLNet (Yang et al., 2019) to compare with the static fastText embeddings. Specifically, either a 1-layer FFNN (/0) or an n-layer BiLSTM (/n + ) was used to convert the 768-unit output to our model size. We used a GeForce GTX 1080 Ti with 11 GB and a TITAN RTX with 24GB memory only for tuning XLNet.
The model size for vector compositionality was set at 300. The hidden sizes for labeling, orientation, and chunking were 200, 64, and 200, re- spectively. Different numbers of layers of the BiLSTM cxt (/n) were explored, and the default was six layers. HINGE-LOSS was the default criterion for orientation while binary cross-entropy (BCE-LOSS) was tested. The coefficients of the three losses were explored and the default were  (2018) and our models belong to type O and have similar complexities. Generally, the accuracy follows the complexity, whereas the speed roughly follows the year of publication rather than complexity or type.

Comparison of Models
Models with fastText. We investigated the binary model through ablation. The impacts of fast-Text are presented in the upper part of Table 4. B/E does not require any external data beyond PTB, which is comparable to models without a pre-trained GloVe (Pennington et al., 2014). Then, we replaced BiLSTM ori with an FFNN to examine its effect. The results are in the bottom rows. The comparison proves whether the embeddings are collaborative for the orientation signals because FFNN regards each input independently.
Finally, we used a grid search to explore the hyperparameter space of our three-loss coefficients.  Figure 7: Probabilistic interpolations of two CNF factors to F1 scores. The capacity of BiLSTM cxt is almost saturated with 6 or 8 layers. Figure 6 shows that the performance correlates to the orientation loss the most, but it is not overly sensitive to the hyperparameters.
Pre-trained language model. We compared the results using frozen fastText with those using frozen XLNet 5 in Table 5. The accuracy of the model increased along with the depth of BiLSTM cxt , and it exhibited the most significant increase across all variants. Owing to XLNet, our complexities grew to O(n 2 ). We fine-tuned our models 6 and compared them with other parsers using fine-tuned language models. These are listed in Table 6.

Tree-Binarization Strategy
To reflect the branching tendency, our best single model for PTB was obtained on the dynamic L95R05 dataset. This dataset is a probabilistic interpolation between the left-factored dataset (for 95% chances) and a right-factored dataset (for 5% chances) in Figure 7. The best model for CTB appeared on the left side at L70R30, scoring 86.14, whereas the best for KTB was on the L30R70 dataset, scoring 87.05 with a 6-layer BiLSTM cxt . Typically, the results for all the corpora had a minimum at L50R50. For English, the left "wing" was higher than the right; the opposite trend was observed for Japanese. For Chinese, no clear trend was obtained.
All studies described in the previous sections were conducted on the PTB L85R15 dataset.

Complexity and Speed
To test our linear speed advantage, we inflated our training data with redundant nodes to resemble the triangular chart of CYK algorithm, as depicted in Figure 8 and Table 7. The parse in the triangular treebank has the worst-case complexity of O(n 2 ). Meanwhile, training with linearity halved the training time, reduced memory usage, and canceled the length limit for our three corpora. There is a sheer difference between linearity and squared complexity.  Table 6: Improvements with pre-trained language models. We used a greedy search algorithm on single GeForce GTX 1080 Ti. Rows 6-8 are reported by Yang and Deng (2020) using GeForce GTX 2080 Ti. Kitaev and Klein (2020) used a cloud TPU with a beam search algorithm and a larger batch size.

Model Structure
Our parsers comprise a neural encoder for scoring (i.e., Algorithm 1) and a non-neural decoder for searching. The decoder is a symbolic extension of the encoder in that both run in bottom-up manner, and the decoder interprets the scores as local-andgreedy decisions. Other neural parsers also fit a similar encoder-decoder framework. However, decoders with dynamic programming often include forward and backward processes heterogeneous to  Speed and size. One of our research goals was to achieve simplicity and efficiency. In terms of speed, our models parallelize more actions than transition-based parsers and have fewer computing nodes than chart parsers. In terms of size, our models contain approximately 4M parameters in addition to the 13M fastText (or 114M XLNet) pre-trained embeddings, which is fewer than those Vector compositionality. The performance of FFNN ⋆ ori is inferior to that of its RNN counterparts, suggesting that some information might not be encoded locally. Thus, the COMPOSE function should remain in a contextual form to collaboratively leverage the whole layer. However, BiRNN might still be a bottleneck for long-range orientation, as suggested in Figure 9. BiLSTM chk is a major weakness of the multi-branching model, especially for longer sentences.

Tree Binarization and Headedness
Tree binarization. Probabilistic interpolation with two CNF-factored datasets is effective for the three languages studied, as shown in Figure 7. Dynamic sampling allows the model to cover a wider range of composed vectors to improve its robustness to ambiguous orientations. Furthermore, it seems counterintuitive for human learners to obtain the best model using left-biased interpolation for a right-branching language or vice versa. However, for a neural model, balancing the frequency seems to be the key factor for improving performance (Sennrich et al., 2016;Zhao et al., 2018). The fact that the L50R50 dataset yielded the worst models also suggests that the balance should be based on a default orientation tendency. This could also be the reason why mid-in or mid-out did not improve the model. Figure 10 show the intermediate parses on the same sentence from our two models. They are typical examples in the output.

NNS IN NN NNS VBZ VBN DT JJ NN RB .
Predicting the financial results of computer firms has been a tough job lately .
pones the combination with adjuncts such as punctuation and adverb (red spans). The high frequencies of determiners in noun phrases make them great attractors. On the other hand, the multi-branching model places close attention on what the syntactic head is supposed to be. In the noun phrases, determiners receive the highest weight averages (red), and the nouns obtain the second (blue). This phenomenon suggests that an English noun phrase's syntactic role is mainly projected from the determiners, as discussed by Zwicky (1985). Table 8 provides more statistical support. For example, the model selects DT as an NP head if it is available; otherwise, nouns and adjectives are prominent heads. Chinese and Japanese parsers work similarly for their headedness. (See Appendix A.5.) `` Margin debt was at a record high . Figure 11: Failed parse from the multi-branching model. The model stops parsing and saves computations when it repeats the same chunking positions.

Error Analysis
The rate of an invalid parse is the last topic that we consider for our parsers. For the binary parser, fatal errors, such as frame-breaking orientations, appear at an early stage of training. However, the late 90% of training time contains very few errors, and our binary model is free from invalid parsing on the test set. For the multi-branching parser, it is observed that 11 out of 2,416 test parses are forests rather than parse trees when they are trained with fastText. However, the multi-branching parser with fine-tuned XLNet reduces the error count on the test set to 1.
We present a failed multi-branching parse with fastText, as shown in Figure 11. The postnominal adjective "high" is uncommon for English. Because the model did not group it with the adjacent "a record" to form an NP, the error propagated to higher layers (e.g., no PP as an adjunct to form a VP), causing the bad parse. It implies that the multibranching model requires an appropriate predictargument configuration to chunk.

Conclusion
We proposed a pair of neural combinatory constituency parsers. The binary one yields F1 scores comparable to those of recent neural parsers. The multi-branching one reveals constituency headedness. Both are simple and efficient with relatively high speeds. We also leveraged a pre-trained language model and CNF factors to increase the accuracy. We reflected the branching tendencies of three languages.

References
sub !$% & !"#$#"% X X X X X X X Figure 13: Adding sub nodes to make flat structure more efficient. Using the strategy as a new dynamic dataset also brings multi-branching model M a stable accuracy improvement with an F1 score of 92.36 on PTB. However, it has nothing to do with linguistic properties. We save it for a future study.

A Appendices
A.1 Compression Ratios and Linearity Figure 12 presents examples for tree binarization and the worst case of O(n 2 ) complexity. Figure 14 shows the overall linear data complexities in the three languages. Figures 15 & 16 and Table 9 indicate that, given a language and a factor, the compression ratio is stable and seldom affected by the sentence length.
The regressions for PTB and CTB show weak O(n 2 ) tendencies; the quadratic coefficients can be either positive or negative. Meanwhile, KTB falls into the worst case, as shown in Figure 14. This is because KTB trees tend to have a flat structure on the right side of parses, as illustrated in Figure  17. Relaying nodes in the flat structure never combine until the final layer, creating strong O(n 2 ) tendencies. As a result, all KTB datasets fall into the worst case, especially when binarized with the CNF-left factor.
A preprocess that groups the flat structure into the sub category can prevent considerable quadratic impacts on all datasets. All O(n 2 ) tendencies are largely weakened across three corpora, and all linear coefficients drop significantly, as illustrated on the right of Figure 14. The preprocess cannot eradicate the worst case in KTB. However, all linear coefficients' magnitudes are at least hundreds of times larger than those of the quadratic terms. In our sub-quadratic case, 200 words lead to approximately 1.5K nodes. Meanwhile, a sentence with n words has a triangular chart with n(n+1) 2 nodes, whose quadratic coefficient is 0.5. In this case, 200 words lead to approximately 20K nodes.

A.2 Experiment Setting
The treebanks PTB and CTB have been widely used for experiments. For PTB, sections 2-21 were used for training, section 22 for development, and section 23 for testing. For CTB, articles 001-270 and 440-1151 were used for training, 301-325 for development, and 271-300 for testing. There is no widely accepted data split for the KTB corpus, except for some probabilistic divisions, because KTB contains mixed data from sources such as newswires, book digests, and Wikipedia. We randomly reserved 2,075 samples for development, 1,863 samples for testing, and the remaining 3.3 million as training samples. Few sentences in the training sets were longer than 100 words (3 of 40K in PTB; 96 of 17K in CTB; 55 of 33K in KTB). Frozen English (wiki.en.bin), Chinese (cc.zh.300.bin), and Japanese (cc.ja.300.bin) embeddings were used for PTB, CTB, and KTB, respectively 7 . We fed fastText with the PTB text to train cbow instead of skipgram embeddings for B/E with their default settings for 50 epochs.
The batch size was 80, and sentences longer than 100 words were excluded for the triangular data to avoid out-of-memory (OOM) errors on a single GeForce GTX 1080 Ti with 11 GB. We froze XLNet to train our model and then tuned XLNet from the 5-th epoch. We doubled the batch size at the inference phase to 160.  We used the Adam optimizer with a default learning rate of 10 −3 , while we opted for the XLNet's Adam hyperparameters when tuning the pre-trained XLNet (e.g., their learning rate was 10 −5 ). We adopted a warm-up period for one epoch and a linear decrease after the 15-th decrease since the last best evaluation. The recurrent dropout rate was 0.2; other dropout probabilities for FFNNs were set to 0.4. For model selection, the training process terminated when the development set did not improve above the highest score after 100 consecutive evaluations. The Evalb program 8 was used for F1 scoring.
We demonstrated score profiles for our main models in Table 10. The discrepancy in F1 scores and difference between precision and recall are relatively small on the PTB development and test sets.

A.3 Variants of Binary Compose
If we choose the relay instruction in line 12 of Algorithm 2, additive vector compositionality is retained (Mikolov et al., 2013)   variant in lines 5-6 of Algorithm 4. The model can infer a full tensor tree; however, ADD causes the vector magnitude to increase with the tree height cumulatively. This is unwanted in the recurrent or recursive neural network. Therefore, we examined a learnable FFNN multi with Sigmoid activation to perform gate-style interpolation in five variants NS, NV, CS, CV, and BV as described in lines 8-17. When a variant takes no input and produces a scalar interpolation parameter λ, we consider this case NS. ("∅" is a placeholder for no input.) Meanwhile, CV indicates concatenated input and vectorized interpolation. BV is a variant that involves a biaffine tensor  Figure 15: Binarized corpora with four factors. Curved tiers can be observed in each plot. For example, the leftmost tier is composed of n−1 n (followed by n−2 n , n−3 n , and so on). The dots in this tier range from a high compression ratio of 0.5 to the least efficient ones in their corpus. Efficient dots are more populated, judging by their sizes and colors. All statistics yield stable means, which are also presented in Table 9. In terms of the F1 score, the most competitive variants of CV are BV and NV, suggesting that fine  interpolation can effectively facilitate vector compositionality. The similarity in results of CS, NS, and ADD validate this suggestion. This indicates that vector compositionality is not as trivial as an additive function at the scalar level, and a matrix operation is sufficient. BV is the costliest variant with a tensor operation that runs very slowly (30 sents/sec).

A.4 Recovering Symbolic Tree
To obtain the final tree representation, we initialized the working place with leaves of words and