Fast-R2D2: A Pretrained Recursive Neural Network based on Pruned CKY for Grammar Induction and Text Representation

Chart-based models have shown great potential in unsupervised grammar induction, running recursively and hierarchically, but requiring O(n³) time-complexity. The Recursive Transformer based on Differentiable Trees (R2D2) makes it possible to scale to large language model pretraining even with a complex tree encoder, by introducing a heuristic pruning method.However, its rule-based pruning process suffers from local optima and slow inference. In this paper, we propose a unified R2D2 method that overcomes these issues. We use a top-down unsupervised parser as a model-guided pruning method, which also enables parallel encoding during inference. Our parser casts parsing as a split point scoring task by first scoring all split points for a given sentence and then using the highest-scoring one to recursively split a span into two parts. The reverse order of the splits is considered as the order of pruning in the encoder. We optimize the unsupervised parser by minimizing the Kullback–Leibler distance between tree probabilities from the parser and the R2D2 model.Our experiments show that our Fast-R2D2 significantly improves the grammar induction quality and achieves competitive results in downstream tasks.


Introduction
Compositional, hierarchical, and recursive processing are widely believed to be essential traits * Work done while at Ant Group.To contact Haitao, haitaomi@global.tencent.com 1 The code is available at: https://github.com/alipay/StructuredLM_RTDT of human language across diverse linguistic theories (Chomsky, 1956(Chomsky, , 2014)).Chart-based models (Maillard et al., 2017;Kim et al., 2019a;Drozdov et al., 2019;Hu et al., 2021) have made promising progress in both grammar induction and hierarchical encoding in recent years.The differential CKY encoding architecture of Maillard et al. (2017) simulates the hierarchical and recursive process explicitly by introducing an energy function to combine all possible derivations when constructing each cell representation.However, this entails a cubic time complexity, which makes it impossible to scale to large language model training like BERT (Devlin et al., 2018).Simultaneously, its cubic memory cost also limits the tree encoder's ability to draw on huge parameter models as a backbone.Hu et al. (2021) introduced a heuristic pruning method, successfully reducing the time complexity to a linear number of compositions.The experiments show that chart-based models exhibit great potential for grammar induction and representation learning when applying a sophisticated tree encoder such as Transformers with large corpus pretraining, leading to a Recursive Transformer based on Differentiable Trees, or R2D2 for short.However, R2D2's heuristic pruning approach is rule-based and only considers certain composition probabilities.Thus, trees constructed in this way are not guaranteed to be globally optimal.Moreover, as each step during pruning is based on previous decisions, the entire encoding process is sequential and thus slow in the inference stage.
In this work, we resolve these issues by proposing a unified method with a new global pruning strategy based on a lightweight and fast top-down parser.We cast parsing as split point scoring, where we first encode the input sentence with a bi-directional LSTM, and score all split points in parallel.Specifically, for a given sentence, the parser first scores each split point between words in parallel by looking at its left and right contexts, and then recursively splits a span (starting with the whole sentence) into two sub-spans by picking the highest-scoring split point among the current split candidates.Subsequently, the reverse order of the sorted split points can serve as the merge order to guide the pruning of the CKY encoder, which enables the encoder to search for more reasonable trees.As the gradient of the pretrained component cannot be back-propagated to the parser, inspired by URNNG (Kim et al., 2019b), we optimize the parser by taking trees sampled from the CKY chart table generated by the encoder as ground truth.Thus, the parser and the chart-based encoder promote each other in this way back and force just like the strategy and value networks in AlphaZero (Silver et al., 2017).Additionally, the pretrained tree encoder can compose sequences recursively in parallel according to the trees generated by the parser, which makes Fast-R2D2 a Recursive Neural Network (Pollack, 1990;Socher et al., 2013) variant.
In this paper, we make the following main contributions: 1. We propose an architecture to jointly pretrain parser and encoder of a recursive network in linear memory cost.Experiments show that our pretrained parser outperforms models custom-tailored for grammar induction.2. By encoding in parallel following trees generated by the top-down parser, Fast-R2D2 significantly improves the inference speed 30 to 50 fold compared to R2D2. 3. We pre-train Fast-R2D2 on a large corpus and evaluate it on downstream tasks.The experiments demonstrate that a pretrained recursive model based on an unsupervised parser significantly outperforms pretrained sequential Transformers (Vaswani et al., 2017) with the same parameter size in single sentence classification tasks.

Preliminaries
2.1 R2D2 Architecture Differentiable Trees.R2D2 follows the work of Maillard et al. (2017) in defining a CKY-style (Cocke, 1969;Kasami, 1966;Younger, 1967) encoder.For a sentence S = {s 1 , s 2 , ..., s n } with n words or word-pieces, it defines a chart table as illustrated in Figure 1.In the table, each cell T i,j is a tuple ⟨e i,j , p i,j , p i,j ⟩, where e i,j is a vector representation, p i,j is the probability of a single composition step, and p i,j is the probability of the subtree for the span [i, j] over the sub-string s i:j .When i equals j, the table has terminal nodes T i,i with e i,i initialized with the embeddings of input tokens s i , while p i,i and p i,i are set to one.When j > i, the representation e i,j is a weighted sum of intermediate combinations c k i,j , defined as: k is a split point from i to j − 1, f (•) is an n-layer Transformer encoder, p k i,j and p k i,j denote the single step combination probability and the subtree probability, respectively, at split point k, p i,j and p i,j are the concatenation of all p k i,j or p k i,j values, and GUMBEL is the Straight-Through Gumbel-Softmax operation of Jang et al. (2017) with temperature set to one.As GUMBEL picks the optimal splitting point k at each cell in practice, it is straightforward to recover the complete derivation tree from the root node T 1,n in a top-down manner recursively.1. Recover the maximum sub-tree for each cell at the m-th level, and collect all cells at the 2nd level that appear in any sub-tree.

Rank candidates in
Step 1 by the composition probability p i,j , and pick the highest-scoring cell as a non-splittable span (e.g., T 1,2 ). 3. Remove any invalid cells that would break the now non-splittable span from Step 2, e.g., the dark cells in (c), and reorganize the chart table much like in the Tetris game as in (d). 4. Encode the blank cells at the m-th level, e.g., the cell highlighted with stripes in (d), and go back to Step 1 until the root cell has been encoded.
Pretraining.To learn meaningful structures without gold trees, Hu et al. (2021) propose a selfsupervised pretraining objective.Similar to the bidirectional masked language model task, R2D2 reconstructs a given token s i based on its context representation e 1,i−1 and e i+1,n .The probability of each token is estimated by the tree encoder defined in R2D2.The final objective is: 3 Methodology

Global Pruning Strategy
We propose a top-down parser based on syntactic distance (Shen et al., 2018) to evaluate scores for all split points in a sentence and generate a merge order according to the scores.
Top-down parser.Given a sentence S = {s 1 , s 2 , ..., s n }, there are n−1 split points between words.We define a top-down parser by giving confidence scores to all split points as follows: To keep it simple and rigorously maintain linear complexity, we select bidirectional LSTMs as the backbone, though Transformers are also an option.As shown in Figure 3, first, a bi-directional LSTM encodes the sentence, and then, for each split point, an MLP over the concatenation of the left and right context representations yields the final split scores.Formally, we have: Here, E is the embedding of the input sentence S, while − → h and ← − h denote the forward and reverse representation, respectively.v i is the score of the i-th split point, whose left and right context representations are − → h i and ← − h i+1 .Given scores [v 1 , v 2 , ..., v n−1 ], one can easily recover the binary tree shown in Figure 3: We recursively split a span (starting with the entire sentence) into two subspans by picking the split point with the highest score in the current span.Taking the sentence in Figure 3 (a) as an example, we split the overall sentence at split point 3 in the first step, which leads to two sub-trees over s 1:3 and s 4:6 .Then we split s 1:3 at 2 and s 4:6 at 4. We can continue this procedure until the complete tree has been recovered.
Tree sampling.In the training stage, we perform sampling over the computed scores [v 1 , v 2 , ..., v n−1 ] in order to increase the robustness and exploration of our model.Let P t denote the list of split points at time t in ascending order, which is {1, 2, 3, ..., n−1} in the first step.Then a particular split point a t is selected from P t by sampling based on the probabilities estimated by stacking of split points scores.The sampled {a 1 , a 2 , ..., a n−1 } together form the final split point sequence A. At each time step, we remove a t from P t when a t is selected, then sample the next split point until the set of remaining split points is empty.Formally, we have: where v t is concatenation of v i in P t .As the Gumbel-Max trick (E.J. Gumbel, 1954;Maddison et al., 2014) provides a simple and efficient way to draw samples from a categorical distribution with class probabilities, we can obtain a t via the Gumbel-Max trick as: where g i is the Gumbel noise for the i-th split point.Therefore, the aforementioned process is equivalent to sorting the original sequence of split points scores with added Gumbel noise.Figure 3 (b) shows a sampled tree with respect to the split point scores.The split point sequence A can hence be obtained simply as: Here, argsort sorts the array in descending order and returns the indices of the original array.The sampled A is {2, 4, 3, 5, 1} in Figure 3 (b).
Span Constraints.As word-pieces (Wu et al., 2016) and Byte-Pair Encoding (BPE) are commonly used in pretrained language models, it is straightforward to incorporate multiple word-piece constraints into the top-down parser to reduce wordlevel parsing errors.We denote a list of span constraints composed of beginning and end positions of non-split-table spans as C, defined as C = {(b 1 , e 1 ), (b 2 , e 2 ), ..., (b n , e n )}.For each (b i , e i ) in C, there should be a sub-tree for a span covering the sub-string s b i :e i .This goal can be achieved by simply adjusting the scores of all splits within the spans in C by −δ.To make them smaller than the scores of span boundaries, δ could be defined as (max(v) − min(v) + c), where c could be any positive number.
Model-based Pruning.We denote the reverse order of the split point sequence A as M and then treat M as a bottom-up merge order inferred by the top-down parser based on the global context.Subsequently, the complete pruning process is as follows: 1. Pick the next merge index by invoking Alg 1. 2. Perform Steps 3 and 4 in the heuristic pruning part in Section 2.1 As shown in Figure 2, we still retain the threshold and the pruning logic of R2D2, but we select cells to merge according to M instead of following heuristic rules.Specifically, given a shrinking chart table, we select the next merge index among the second row by popping and modifying M in Algorithm 1.
Algorithm 1 Next merge index in the second row for j ∈ 1 to M.len do 4: if Mj > i then ▷ Merging at left 5: Take the example in Figure 3 (b) for instance: M starts with {1, 5, 3, 4, 2}.Then we merge the first cell in the second row in Figure 2 (b), and obtain a new M = {4, 2, 3, 1}.In the next round, we treat the 4th cell covering s 5:6 as a non-splittable cell in Figure 2 (e), and M becomes {2, 3, 1}.

Optimization
We denote the tree probabilities estimated by the top-down parser and R2D2 as p θ (z|S), q ϕ (z|S), respectively.The difficulty here is that while q ϕ (z|S) may be optimized by the objective defined in Equation 6, there is no gradient feedback for p θ (z|S).To make p θ (z|S) learnable, an intuitive solution is to fit p θ (z|S) to q ϕ (z|S) by minimizing their Kullback-Leibler distance.While the tree probabilities of both distributions are discrete and not exhaustive, inspired by URNNG (Kim et al., 2019b), a Monte Carlo estimate for the gradient with respect to θ can be defined as: with samples z (1) , ..., z (K) from q ϕ (z|S).Algorithm 2 shows the complete sampling process from q ϕ (z|S).Specifically, we sample split points with corresponding span boundaries recursively as in previous work (Goodman, 1998;Finkel et al., 2006;Kim et al., 2019b) with respect to the intermediate tree probabilities calculated during hierarchical encoding.
Algorithm 2 Top-down tree sampling for R2D2 for k ∈ 1 to len(L) do 10: ▷ Sample a split point 14: push(K, (L[idx], i, j)) 15: ▷ Keep the split point and span boundary 16: if return K A sequence of split points and corresponding spans is returned by the sampler.For the k-th sample z (k) , let p θ (a k t |S) denote the probability of taking a k t as split from span (i k t , j k t ) at the t-th step.Formally, we have: where i k t and j k t denote the start and end of the corresponding span.Please note here that the v i are not adjusted by span constraints.

Downstream Tasks
Inference.In this paper, we mainly focus on classification tasks as downstream tasks.We consider the root representation as representing the entire sentence.As we have two models pre-trained in our framework -an R2D2 encoder and a top-down parser -we have two ways of generating the representations: a) Run forced encoding over the binary tree from the top-down parser with the R2D2 encoder.b) Use the binary tree to guide the pruning of the R2D2 encoder, and take the root representation e 1,n .
It is obvious that the first approach is much faster than the latter one, as the R2D2 encoder only runs n − 1 times in forced encoding, and can run in parallel layer by layer, e.g., we may run compositions at a 5 , a 3 , and a 4 in parallel in Figure 3 (b).We explore both of these approaches in our experiments.
Training Objectives.As suggested in prior work (Radford et al., 2018;Howard and Ruder, 2018;Gururangan et al., 2020), given a pretrained model, continued pretraining on an in-domain corpus with the same pretraining objective can yield a better generalization ability.Thus, we simply combine our pretraining objectives via summation in all downstream tasks.At the same time, as the downstream task may guide R2D2 to more reasonable tree structures, we still maintain the KL loss to enable the parser to continuously update.For the two inference methods, we uniformly select the root representation e 1,n as the representation for a given sentence followed by an MLP, and estimate the cross-entropy loss, denoted as L forced and L cky , respectively.Let L KL denote the KL loss described in Section 3.2 and L bilm denote the bidirectional language model loss described in Eq 6.The final loss is:  (Shen et al., 2019b), URNNG (Kim et al., 2019b), DIORA (Drozdov et al., 2019), C-PCFG (Kim et al., 2019a), and R2D2 (Hu et al., 2021).To observe the marginal gain from pretraining, we also include Fast-R2D2 without pretraining denoted as Fast-R2D2 w/o .Following Htut et al.
(2018), we train all systems on a training set consisting only of raw text, and evaluate and report the results on an annotated test set.As an evaluation metric, we adopt sentence-level unlabeled F 1 computed using the script from Kim et al. (2019a).We compare against the non-binarized gold trees per convention.The results of Fast-R2D2 are obtained from 3 runs of each model with different random seeds in pre-training.The best checkpoint for each system is picked based on scores on the validation set.Fast-R2D2 is pretrained with span constraints for the word level but without span constraints for the word-piece level.To support word-piece level evaluation, we convert gold trees to word-piece level trees by simply breaking each terminal node into a non-terminal node with its word-pieces as terminals, e.g., (NN discrepancy) into (NN (WP disc) (WP ##re) (WP ##pan) (WP ##cy)).
Environment.EFLOPS (Dong et al., 2020) is a highly scalable distributed training system designed by Alibaba.With its optimized hardware architecture and co-designed supporting software tools, including ACCL (Dong et al., 2021) and KSpeed (the high-speed data-loading service), it could easily be extended to 10K nodes (GPUs) with linear scalability.
Hyperparameters.The tree encoder of our model uses 4-layer Transformers with 768dimensional embeddings, 3,072-dimensional hidden layer representations, and 12 attention heads.The top-down parser of our model uses a 4-layer bidirectional LSTM with 128-dimensional embeddings and 256-dimensional hidden layer.The sampling number K is set to be 256.Training is conducted using Adam optimization with weight decay using a learning rate of 5 × 10 −5 for the tree encoder and 1 × 10 −2 for the top-down parser.The batch size is set to 64 per GPU for m=4, though we also limit the maximum total length for each batch, such that excess sentences are moved to the next batch.The limit is set to 1,536.It takes about 120 hours for 60 epochs of training with m=4 on 8 A100 GPUs.
Data.For English, to fully leverage the scalability of Fast-R2D2, we pretrain Fast-R2D2 on Wiki-Text103 (Merity et al., 2017) and then fine-tune the model on the Penn Treebank (PTB) (Marcus et al., 1993) et al., 2005).We test our approach on PTB WSJ data with the standard splits (2-21 for training, 22 for validation, 23 for test) and the same preprocessing as in recent work (Kim et al., 2019a), where we discard punctuation and lower-case all tokens.
To explore the universality of the model across languages, we further evaluate using the CTB, on which we also remove punctuation.Note that in all settings, the training and fine-tuning is conducted entirely on raw unannotated text.

Results and Discussion
Table 1 shows the results of all systems with words (W) and word-pieces (WP) as input on the WSJ and CTB test sets.When we evaluate all systems on word-level golden trees, our Fast-R2D2 performs substantially better than R2D2 across both datasets.We denote as Fast-R2D2 the method of using the parser to guide the pruning and selecting the best tree using the chart table and as Fast-R2D2 * the system that uses the top-down parser for tree induction with subsequent R2D2 encoding.Interestingly, the results suggest that Fast-R2D2  Kim et al. (2019a).
Following Kim et al. (2019b) and Drozdov et al. (2020), we also compute the recall of constituents when evaluating on word-piece level golden trees.Besides standard constituents, we also compare the recall of word-piece chunks and proper noun chunks.Proper noun chunks are extracted by finding adjacent unary nodes with the same parent and tag NNP.Table 2 reports the recall scores for constituents and words on the WSJ and CTB test sets.Compared with the R2D2 baseline, our Fast-R2D2 performs slightly worse for small semantic units, but significantly better over larger semantic units (such as VP and SBAR) on the WSJ test set.On the CTB test set, our Fast-R2D2 outperforms R2D2 on all constituents.
From Tables 1 and 2, we conclude that Fast-R2D2 overall obtains better results than R2D2 on CTB, while faring slightly worse than R2D2 only for small semantic units on WSJ.We conjecture that this difference stems from differences in tokenization between Chinese and English.Chinese is a character-based language without complex morphology, where collocations of characters are consistent with the language, making it easier for the top-down parser to learn them well.In contrast, word-pieces for English are built based on statistics, and individual word-pieces are not necessarily natural semantic units.Thus, there may not be sufficient semantic self-consistency, such that it is harder for a top-down parser with a small number of parameters to fit it well.

Downstream Tasks
We next consider the effectiveness of Fast-R2D2 in downstream tasks.This experiment is not intended to advance the state-of-the-art on the GLUE benchmark but rather to assess to what extent our approach performs respectably against the dominant inductive bias as in conventional sequential Transformers.

Setup
Data and Baseline.We fine-tune pretrained models on several datasets, including SST-2, CoLA, QQP, and MNLI from the GLUE benchmark (Wang et al., 2018).As sequential Transformers with their dominant inductive bias remain the norm for numerous NLP tasks, we mainly compare Fast-R2D2 with BERT (Devlin et al., 2018) as a representative pretrained model based on a sequential Transformer.We did not include recursive models such as Gumbel-Tree-LSTMs (Choi et al., 2018) and CRvNN (Chowdhury and Caragea, 2021) among our baselines, as they are not pretrained models.In order to compare the two forms of inductive bias fairly and efficiently, we pretrain BERT models with 4 layers and 12 layers as well as our Fast-R2D2 from scratch on the WikiText103 corpus following Section 4.1.1.Considering that longer inputs in the pre-training stage are helpful for BERT's downstream task performance, we use the original corpus that is not split into sentences as inputs.For simplicity, Fast-R2D2 is fine-tuned without span constraints.Following the common settings, we add an MLP layer over the root representation of the R2D2 encoder for single-sentence classification.For cross-sentence tasks such as QQP and MNLI, we feed the root representations of the two sentences into the pretrained tree encoder of R2D2 as left and right inputs, and also add a new task ID as another input term to the R2D2 encoder.Then we feed the hidden output of the new task ID into another MLP layer to predict the final label.We train all systems across the four datasets for 10 epochs with a learning rate of 5 × 10 −5 , batch size 64, and maximum input length 200.We validate each model in each epoch and report the best results on development sets.

Results and Discussion
Table 3 shows the corresponding scores on SST-2, CoLA, QQPl, and MNLI.In terms of the parameter size, our Fast-R2D2 model has 52M and 10M parameters for the R2D2 encoder and top-down parser, respectively.It is clear that 12-layer BERT is significantly better than 4-layer BERT.As mentioned in Section 3.3, Fast-R2D2 has two options to construct the final tree and representation for a given input sentence: Fast-R2D2 * uses the output tree from the top-down parser, while Fast-R2D2 uses the best tree inferred by the R2D2 encoder.Similar to the results for unsupervised parsing, Fast-R2D2 * in classification tasks again outperforms Fast-R2D2.We hypothesize that trees generated by the top-down parser without Gumbel noise are more stable and reasonable.Fast-R2D2 significantly outperforms 4-layer BERT and achieves competitive results compared to 12-layer BERT in single sentence classification tasks such as SST-2 and CoLA, but still performs significantly worse in the cross-sentence tasks.We believe this is an expected result, as there is no cross-attention mechanism in the inductive bias of Fast-R2D2.However, the performance of Fast-R2D2 on classification tasks shows that the inductive bias of R2D2

Speed Evaluation
To assess the time cost, we mainly compare sequential Transformers and Fast-R2D2 in forced encoding on various sequence length ranges.We randomly select 1,000 sentences for each range from WikiText103 and report the average time consumption on a single A100 GPU.BERT is based on the open source Transformers library2 and R2D2 is based on the official code in Hu et al. (2021). 3able 4 shows the inference time in seconds for different systems to process 1,000 sentences with a batch size of 50.Running R2D2 is time-consuming, since the heuristic pruning method involves substantial memory exchanges between GPU and CPU.In Fast-R2D2, we alleviate this problem by using model-guided pruning to accelerate the chart table processing, in conjunction with a code implementation in CUDA, Fast-R2D2 reduces the inference time significantly.Fast-R2D2 * further improves the inference speed by running forced encoding in parallel over the binary tree generated by the parser, which is about 30-50 times faster than R2D2 in various ranges.Although there is still a gap in speed compared to sequential Transformers, Fast-R2D2 * is sufficiently fast for most NLP tasks while producing interpretable intermediate representations.

Related Work
Many attempts have been done in grammar induction and hierarchical encoding.Clark (2001) and Klein and Manning (2002) provided some of the first successful statistical approaches to grammar induction.There have been multiple recent papers that focus on structure induction based on language modeling objectives (Shen et al., 2019a(Shen et al., ,b, 2021;;Kim et al., 2019a).Pollack (1990) proposed to use RvNN as a recursive architecture to encode text hierarchically, and Socher et al. (2013) showed the effectiveness of RvNNs with gold trees for sentiment analysis.In this work, we focus on models that are capable of learning meaningful structures in an unsupervised way and encoding text over the induced tree hierarchically.
In the line of work on learning a sentence representation with structures, Yogatama et al. (2017) jointly train their shift-reduce parser and sentence embedding components without gold trees.As their parser is not differentiable, they have to resort to reinforcement training, resulting in increased variance, which may easily collapse to trivial left or right branching trees.Gumbel-Tree-LSTMs (Choi et al., 2018) construct trees by recursively selecting two terminal nodes to merge and learning the composition probability via downstream tasks.CRvNN (Chowdhury and Caragea, 2021) makes the entire process end-to-end differentiable and parallel by introducing a continuous relaxation.URNNG (Kim et al., 2019b) propose the first architecture to jointly pretrain parser and encoder based on RNNG (Dyer et al., 2016).However, it has O(n 3 ) complexity and remains unable to improve upon a right-branching baseline when punctuation is removed.Maillard et al. (2017) propose an alternative approach, based on a differentiable CKY encoding.The algorithm is differentiable due to a soft-gating approach, which approximates discrete candidate selection by a probabilistic mixture of the constituents available in a given cell of the chart.While their work relies on annotated downstream tasks to learn structures, Drozdov et al. (2019) propose a novel auto-encoder-like pretraining objective based on the inside-outside algorithm (Baker, 1979;Casacuberta, 1994).

Conclusion
In this paper, we have presented Fast-R2D2, which improves the performance and inference speed of R2D2 by introducing a fast top-down parser to guide the pruning of the R2D2 encoder.Pretrained on the same corpus, Fast-R2D2 significantly outperforms sequential Transformers with a similar scale of parameters on classification tasks.Experimental results show that Fast-R2D2 is a promising and feasible way to learn hierarchical text representations, which is different from layer stacking models and can also generate interpretable intermediate representations.As future work, we are investigating leveraging the intermediate representations in additional downstream tasks.

Limitations
Our approach has three major limitations.First, Fast-R2D2 has shortcomings with regard to crosssentence tasks due to the lack of cross-attention between sentences.Second, Fast-R2D2 requires greater memory resources for pretraining compared to sequential Transformers.At each invocation, the composition function takes four inputs and runs on m candidates, which means the total number of calls to the MLP is 4mn.Hence, the pretraining time of Fast-R2D2 is about 3 to 4 times that of BERT with 12 layers.Finally, our model does not beat most of the baselines in grammar induction when trained on WSJ only.A side effect of the pruning strategy is that the chart-table actually is a sparse table, which means not all tokens are reconstructed based on complete context.This issue can be alleviated by pre-training on a large corpus, which is what our method is designed for, and why we introduce the ability to parallelize the computation.

Acknowledgement
We would like to thank the Aliyun EFLOPS team for their substantial support in designing and providing a cutting-edge training platform to facilitate fast experimentation in this work.We would also like to thank the Zhixiaobao team for their support in applying our model to real applications.

Tree Examples
System Tree FAST-R2D2 pricing cycles to be sure are nothing new for plastics producers GOLD pricing cycles to be sure are nothing new for plastics producers FAST-R2D2 we were all wonderful heroes last year says an executive at one of quantum ' s competitors GOLD we were all wonderful heroes last year says an executive at one of quantum 's competitors FAST-R2D2 a quick turn ##around is crucial to quantum because its cash requirements remain heavy GOLD A quick turnaround is crucial to Quantum because its cash requirements remain heavy FAST-R2D2 some analysts saw the payment as an effort also to di ##sp ##el takeover speculation GOLD Some analysts saw the payment as an effort also to dispel takeover speculation FAST-R2D2 ford motor co . said it acquired 5 % of the shares in jaguar plc GOLD Ford Motor Co. said it acquired 5 % of the shares in Jaguar PLC

Figure 1 :
Figure 1: Chart data structure.There are two alternative ways of generating T 1,3 : combining either (T 1,2 , T 3,3 ) or (T 1,1 , T 2,3 ).Heuristic pruning.As shown in Figure 2, R2D2 starts to prune if all cells beneath height m have been encoded.The heuristic rules work as follows:

Figure 2 :
Figure 2: Example of chart pruning and encoding process.With R2D2's original heuristic pruning, cells to merge are selected according to local composition probabilities.For better model-based pruning, we propose selecting cells according to the merge order estimated by a top-down parser.

Figure 3 :
Figure 3: (a) A parsed tree obtained by sorting split scores (v i ).(b) A sampled tree by adding Gumbel noise (g i in dark vertical bars).

Table 2 :
Recall of constituents and words.WD means word.Values with † are taken from