A Conditional Splitting Framework for Efficient Constituency Parsing

We introduce a generic seq2seq parsing framework that casts constituency parsing problems (syntactic and discourse parsing) into a series of conditional splitting decisions. Our parsing model estimates the conditional probability distribution of possible splitting points in a given text span and supports efficient top-down decoding, which is linear in number of nodes. The conditional splitting formulation together with efficient beam search inference facilitate structural consistency without relying on expensive structured inference. Crucially, for discourse analysis we show that in our formulation, discourse segmentation can be framed as a special case of parsing which allows us to perform discourse parsing without requiring segmentation as a pre-requisite. Experiments show that our model achieves good results on the standard syntactic parsing tasks under settings with/without pre-trained representations and rivals state-of-the-art (SoTA) methods that are more computationally expensive than ours. In discourse parsing, our method outperforms SoTA by a good margin.


Introduction
A number of formalisms have been introduced to analyze natural language at different linguistic levels. This includes syntactic structures in the form of phrasal and dependency trees, semantic structures in the form of meaning representations (Banarescu et al., 2013;Artzi et al., 2013), and discourse structures with Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) or Discourse-LTAG (Webber, 2004). Many of these formalisms have a constituency structure, where textual units (e.g., phrases, sentences) are organized into nested constituents. For example, Figure 1 shows examples of a phrase structure tree and a sentence-level discourse tree (RST) that respectively represent how the phrases and clauses are hierarchically or-ganized into a constituency structure. Developing efficient and effective parsing solutions has always been a key focus in NLP. In this work, we consider both phrasal (syntactic) and discourse parsing.
In recent years, neural end-to-end parsing methods have outperformed traditional methods that use grammar, lexicon and hand-crafted features. These methods can be broadly categorized based on whether they employ a greedy transition-based, a globally optimized chart parsing or a greedy topdown algorithm. Transition-based parsers (Dyer et al., 2016;Cross and Huang, 2016;Liu and Zhang, 2017;Wang et al., 2017) generate trees auto-regressively as a form of shift-reduce decisions. Though computationally attractive, the local decisions made at each step may propagate errors to subsequent steps due to exposure bias (Bengio et al., 2015). Moreover, there may be mismatches in shift and reduce steps, resulting in invalid trees.
Chart based methods, on the other hand, train neural scoring functions to model the tree structure globally (Durrett and Klein, 2015;Gaddy et al., 2018;Kitaev and Klein, 2018;Zhang et al., 2020b;Joty et al., 2012Joty et al., , 2013. By utilizing dynamic programming, these methods can perform exact inference to combine these constituent scores into finding the highest probable tree. However, they are generally slow with at least O(n 3 ) time complexity. Greedy top-down parsers find the split points recursively and have received much attention lately due to their efficiency, which is usually O(n 2 ) (Stern et al., 2017a;Shen et al., 2018;Lin et al., 2019;Nguyen et al., 2020). However, they still suffer from exposure bias, where one incorrect splitting step may affect subsequent steps.
Discourse parsing in RST requires an additional step -discourse segmentation which involves breaking the text into contiguous clause-like units called Elementary Discourse Units or EDUs ( Figure 1). Traditionally, segmentation has been considered separately and as a prerequisite step for the parsing task which links the EDUs (and larger spans) into a discourse tree (Soricut and Marcu, 2003;Joty et al., 2012;Wang et al., 2017). In this way, the errors in discourse segmentation can propagate to discourse parsing (Lin et al., 2019).
In this paper, we propose a generic top-down neural framework for constituency parsing that we validate on both syntactic and sentence-level discourse parsing. Our main contributions are: • We cast the constituency parsing task into a series of conditional splitting decisions and use a seq2seq architecture to model the splitting decision at each decoding step. Our parsing model, which is an instance of a Pointer Network (Vinyals et al., 2015a), estimates the pointing score from a span to a splitting boundary point, representing the likelihood that the span will be split at that point and create two child spans.
• The conditional probabilities of the splitting decisions are optimized using a cross entropy loss and structural consistency is maintained through a global pointing mechanism. The training process can be fully parallelized without requiring structured inference as in (Shen et al., 2018;Gómez and Vilares, 2018;Nguyen et al., 2020).
• Our model enables efficient top-down decoding with O(n) running time like transition-based parsers, while also supporting a customized beam search to get the best tree by searching through a reasonable search space of high scoring trees. The beam-search inference along with the structural consistency from the modeling makes our approach competitive with existing structured chart methods for syntactic (Kitaev and Klein, 2018) and discourse parsing (Zhang et al., 2020b). Moreover, our parser does not rely on any handcrafted features (not even part-of-speech tags), which makes it more efficient and be flexible to different domains or languages.
• For discourse analysis, we demonstrate that our method can effectively find the segments (EDUs) by simply performing one additional step in the top-down parsing process. In other words, our method can parse a text into the discourse tree without needing discourse segmentation as a prerequisite; instead, it produces the segments as a by-product. To the best of our knowledge, this is the first model that can perform segmentation and parsing in a single embedded framework.
In the experiments with English Penn Treebank, our model without pre-trained representations achieves 93.8 F1, outperforming all existing methods with similar time complexity. With pre-training, our model pushes the F1 score to 95.7, which is on par with the SoTA while supporting faster decoding with a speed of over 1,100 sentences per second (fastest so far). Our model also performs competitively with SoTA methods on the multilingual parsing tasks in the SPMRL 2013/2014 shared tasks. In discourse parsing, our method establishes a new SoTA in end-to-end sentence-level parsing performance on the RST Discourse Treebank with an F1 score of 78.82.
We make our code available at https://ntunlpsg.github.io/project/conditionconstituency-style-parser/ 2 Parsing as a Splitting Problem Constituency parsing (both syntactic and discourse) can be considered as the problem of finding a set of labeled spans over the input text (Stern et al., 2017a). Let S(T ) denote the set of labeled spans for a parse tree T , which can formally be expressed as (excluding the trivial singleton span layer): where l t is the label of the text span (i t , j t ) encompassing tokens from index i t to index j t . Previous approaches to syntactic parsing (Stern et al., 2017a;Kitaev and Klein, 2018;Nguyen et al., 2020) train a neural model to score each possible span and then apply a greedy or dynamic programming algorithm to find the parse tree. In other words, these methods are span-based formulation.
In contrary, we formulate constituency parsing as the problem of finding the splitting points in a recursive, top-down manner. For each parent node in a tree that spans over (i, j), our parsing model is trained to point to the boundary between the tokens at k and k + 1 positions to split the parent span into two child spans (i, k) and (k + 1, j). This is done through the Pointing mechanism (Vinyals et al., 2015a), where each splitting decision is modeled as a multinomial distribution over the input elements, which in our case are the token boundaries.
The correspondence between token-and boundary-based representations of a tree is straightforward. After including the start (<sos>) and end (<eos>) tokens, the token-based span (i, j) is equivalent to the boundary-based span (i − 1, j) Figure 1: A syntactic tree at the left and a discourse tree (DT) at the right; both have a constituency structure. The internal nodes in the discourse tree (Elaboration, Same-Unit) represent coherence relations and the edge labels indicate the nuclearity statuses ('N' for Nucleus and 'S' for Satellite) of the child spans. Below the tree, we show the labeled span and splitting representations. The bold splits in the DT representation (C(DT )) indicate the end of further splitting into smaller spans (i.e., they are EDUs). and the boundary between i-th and (i+1)-th tokens is indexed as i. For example, the (boundary-based) span "enjoys playing tennis" in Figure 1 is defined as (1, 4). Similarly, the boundary between the tokens "enjoys" and "playing" is indexed with 2. 1 Following the common practice in syntactic parsing, we binarize the n-ary tree by introducing a dummy label ∅. We also collapsed the nested labeled spans in the unary chains into unique atomic labels, such as S-VP in Figure 1. Every span represents an internal node in the tree, which has a left and a right child. Therefore, we can represent each internal node by its split into left and right children. Based on this, we define the set of splitting decisions C(T ) for a syntactic tree T as follows.
Proposition 1 A binary syntactic tree T of a sentence containing n tokens can be transformed into a set of splitting decisions C(T ) = {(i, j) ) k : i < k < j} such that the parent span (i, j) is split into two child spans (i, k) and (k, j).
An example of the splitting representation of a tree is shown in Figure 1 (without the node labels). Note that our transformed representation has a one-toone mapping with the tree since each splitting decision corresponds to one and only one internal node in the tree. We follow a depth-first order of the decision sequence, which in our preliminary experiments showed more consistent performance than other alternatives like breadth-first order.
Extension to End-to-End Discourse Parsing Note that in syntactic parsing, the split position must be within the span but not at its edge, that is, k must satisfy i < k < j for each boundary span (i, j). Otherwise, it will not produce valid sub-trees. In this case, we keep splitting until each span contains a single leaf token. However, for discourse trees, each leaf is an EDU -a clause-like unit that can contain one or multiple tokens.
Unlike previous studies which assume discourse segmentation as a pre-processing step, we propose a unified formulation that treats segmentation as one additional step in the top-down parsing process. To accommodate this, we relax Proposition 1 as: Proposition 2 A binary discourse tree DT of a text containing n tokens can be transformed into a set of splitting decisions C(DT ) = {(i, j) ) k : i < k ≤ j} such that the parent span (i, j) gets split into two child spans (i, k) and (k, j) for k < j or a terminal span or EDU for k = j (end of splitting the span further).
We illustrate it with the DT example in Figure  1. Each splitting decision in C(DT ) represents either the splitting of the parent span into two child spans (when the splitting point is strictly within the span) or the end of any further splitting (when the splitting point is the right endpoint of the span). By making this simple relaxation, our formulation can not only generate the discourse tree (in the former case) but can also find the discourse segments (EDUs) as a by-product (in the latter case).

Seq2Seq Parsing Framework
Let C(T ) and L(T ) respectively denote the structure (in split representation) and labels of a tree T (syntactic or discourse) for a given text x. We can express the probability of the tree as: Figure 2: Our syntatic parser along with the decoding process for a given sentence. The input to the decoder at each step is the representation of the span to be split. We predict the splitting point using a biaffine function between the corresponding decoder state and the boundary-based encoder representations. A label classifier is used to assign labels to the left and right spans.
This factorization allows us to first infer the tree structure from the input text, and then find the corresponding labels. As discussed in the previous section, we consider the structure prediction as a sequence of splitting decisions to generate the tree in a top-down manner. Specifically, at each decoding step t, the output y t represents the splitting decision (i t , j t ) ) k t and y <t represents the previous splitting decisions. Thus, we can express the probability of the tree structure as follows: This can effectively be modeled within a Seq2Seq pointing framework as shown in Figure 2. At each step t, the decoder autoregressively predicts the split point k t in the input by conditioning on the current input span (i t , j t ) and previous splitting decisions (i, j) ) k) <t . This conditional splitting formulation (decision at step t depends on previous steps) can help our model to find better trees compared to non-conditional top-down parsers (Stern et al., 2017a;Shen et al., 2018;Nguyen et al., 2020), thus bridging the gap between the global (but expensive) and the local (but efficient) models. The labels L(T ) can be modeled by using a label classifier, as described later in the next section.

Model Architecture
We now describe the components of our parsing model: the sentence encoder, the span representation, the pointing model and the labeling model.

Sentence Encoder
Given an input sequence of n tokens x = (x 1 , . . . , x n ), we first add <sos> and <eos> markers to the sequence. After that, each token t in the sequence is mapped into its dense vector representation e t as where e char t , e word t are respectively the character and word embeddings of token t. Similar to (Kitaev and Klein, 2018;Nguyen et al., 2020), we use a character LSTM to compute the character embedding of a token. We experiment with both randomly initialized and pretrained token embeddings. When pretrained embedding is used, the character embedding is replaced by the pretrained token embedding. The token representations are then passed to a 3layer Bi-LSTM encoder to obtain their contextual representations. In the experiments, we find that even without the POS-tags, our model performs competitively with other baselines that use them.

Boundary and Span Representations
To represent each boundary between positions k and k + 1, we use the fencepost representation (Cross and Huang, 2016;Stern et al., 2017a): where f k and b k+1 are the forward and backward LSTM hidden vectors at positions k and k + 1, re- Here we have shown the representation for the boundary at 1 and the representation of the boundary-based span (0, 5) that corresponds to the sentence "She enjoys playing tennis .".
spectively. To represent the span (i, j), we compute a linear combination of the two endpoints This span representation will be used as input to the decoder. Figure 3 shows the boundary-based span representations for our example. The Decoder Our model uses a unidirectional LSTM as the decoder. At each decoding step t, the decoder takes as input the corresponding span (i, j) (specifically, h i,j ) and its previous state d t−1 to generate the current state d t and then apply a biaffine function (Dozat and Manning, 2017) between d t and all of the encoded boundary representations (h 0 , h 1 , . . . , h n ) as follows: where each MLP operation includes a linear transformation with LeakyReLU activation to transform d and h into equal-sized vectors, and W dh ∈ IR d×d and w h ∈ IR d are respectively the weight matrix and weight vector for the biaffine function. The biaffine scores are then passed through a softmax layer to acquire the pointing distribution a t ∈ [0, 1] n for the splitting decision. When decoding the tree during inference, at each step we only examine the 'valid' splitting points between i and j -for syntactic parsing, it is i < k < j and for discourse parsing, it is i < k ≤ j.
Label Classifier For syntactic parsing, we perform the label assignments for a span (i, j) as: where each of MLP l and MLP r includes a linear transformation with LeakyReLU activations to transform the left and right spans into equal-sized vectors, and W lr ∈ IR d×L×d , W l ∈ IR d×L , W r ∈ IR d×L are the weights and b is a bias vector with L being the number of phrasal labels. For discourse parsing, we perform label assignment after every split decision since the label here represents the relation between the child spans. Specifically, as we split a span (i, j) into two child spans (i, k) and (k, j), we determine the relation label as the following.
Training Objective The total loss is simply the sum of the cross entropy losses for predicting the structure (split decisions) and the labels: where θ = {θ e , θ d , θ label } denotes the overall model parameters, which includes the encoder parameters θ e shared by all components, parameters for splitting θ d and parameters for labeling θ label .

Top-Down Beam-Search Inference
As mentioned, existing top-down syntactic parsers do not consider the decoding history. They also perform greedy inference. With our conditional splitting formulation, our method can not only model the splitting history but also enhance the search space of high scoring trees through beam search. At each step, our decoder points to all the encoded boundary representations which ensures that the pointing scores are in the same scale, allowing a fair comparison between the total scores of all candidate subtrees. With these uniform scores, we could apply a beam search to infer the most probable tree using our model. Specifically, the method generates the tree in depth-first order while maintaining top-B (beam size) partial trees at each step. It terminates exactly after n − 1 steps, which matches the number of internal nodes in the tree. Because beam size B is constant with regards to the sequence length, we can omit it in the Big O notation. Therefore, each decoding step with beam search can be parallelized (O(1) complexity) using GPUs. This makes our algorithm run at O(n) time complexity, which is faster than most top-down methods. If we strictly use CPU, our method runs at O(n 2 ), while chart-based parsers run at O(n 3 ). Algorithm 1 illustrate the syntactic tree inference procedure. We also propose a similar version of the inference algorithm for discourse parsing in the Appendix.

Experiment
Datasets and Metrics To show the effectiveness of our approach, we conduct experiments on both syntactic and sentence-level RST parsing tasks. 2 We use the standard Wall Street Journal (WSJ) part of the Penn Treebank (PTB) (Marcus et al., 1993) for syntactic parsing and RST Discourse Treebank (RST-DT) (Lynn et al., 2002) for discourse parsing. For syntactic parsing, we also experiment with the multilingual parsing tasks on seven different languages from the SPMRL 2013-2014 shared task (Seddah et al., 2013): Basque, French, German, Hungarian, Korean, Polish and Swedish.
For evaluation on syntactic parsing, we report the standard labeled precision (LP), labeled recall (LR), and labelled F1 computed by evalb 3 . For evaluation on RST-DT, we report the standard span, nuclearity label, relation label F1 scores, computed using the implementation of (Lin et al., 2019). 4

English (PTB) Syntactic Parsing
Setup We follow the standard train/valid/test split, which uses Sections 2-21 for training, Section 22 for development and Section 23 for evaluation. This results in 39,832 sentences for training, 1,700 for development, and 2,416 for testing. For our model, we use an LSTM encoder-decoder framework with a 3-layer bidirectional encoder and 3layer unidirectional decoder. The word embedding size is 100 while the character embedding size is 50; the LSTM hidden size is 400. The hidden dimension in MLP modules and biaffine function for split point prediction is 500. The beam width B is set to 20. We use the Adam optimizer (Kingma and Ba, 2015) with a batch size of 5000 tokens, and an initial learning rate of 0.002 which decays at the rate 0.75 exponentially at every 5k steps. Model selection for final evaluation is performed based on the labeled F1 score on the development set.
Results without Pre-training From the results shown in Table 1, we see that our model achieves an F1 of 93.77, the highest among models that use  top-down methods. Specifically, our parser outperforms Stern et al. (2017a); Shen et al. (2018) by about 2 points in F1-score and Nguyen et al. (2020) by ∼1 point. Notably, without beam search (beam width 1 or greedy decoding), our model achieves an F1 of 93.40, which is still better than other topdown methods. Our model also performs competitively with CKY-based methods like (Kitaev and Klein, 2018;Zhang et al., 2020b;Wei et al., 2020;Zhou and Zhao, 2019), while these methods run slower than ours.
Plus, Zhou and Zhao (2019) uses external supervision (head information) from the dependency parsing task. Dependency parsing models, in fact, have a strong resemblance to the pointing mechanism that our model employs (Ma et al., 2018). As such, integrating dependency parsing information into our model may also be beneficial. We leave this for future work.
Results with Pre-training Similar to (Kitaev and Klein, 2018;Kitaev et al., 2019), we also eval-uate our parser with BERT embeddings (Devlin et al., 2019). They fine-tuned Bert-large-cased on the task, while in our work keeping it frozen was already good enough (gives training efficiency). As shown in Table 2, our model achieves an F1 of 95.7, which is on par with SoTA models. However, our parser runs faster than other methods. Specifically, our model runs at O(n) time complexity, while CKY needs O(n 3 ). Comprehensive comparisons on parsing speed are presented later.

SPMRL Multilingual Syntactic Parsing
We use the identical hyper-parameters and optimizer setups as in English PTB. We follow the standard train/valid/test split provided in the SPMRL datasets; details are reported in the Table 3.

Language
Train Valid Test  From the results in Table 4, we see that our model achieves the highest F1 in French, Hungarian and Korean and higher than the best baseline by 0.06, 0.15 and 0.13, respectively. Our method also rivals existing SoTA methods on other languages even though some of them use predicted POS tags (Nguyen et al., 2020) or bigger models (75M parameters) (Kitaev and Klein, 2018). Meanwhile, our model is smaller (31M), uses no extra information and runs 40% faster.

Discourse Parsing
Setup For discourse parsing, we follow the standard split from (Lin et al., 2019), which has 7321 sentence-level discourse trees for training and 951 for testing. We also randomly select 10% of the training for validation. Model selection for testing is performed based on the F1 of relation labels on the validation set. We use the same model settings as the constituency parsing experiments, with BERT as pretrained embeddings. 5   Results Table 5 compares the results on the discourse parsing tasks in two settings: (i) when the EDUs are given (gold segmentation) and (ii) endto-end parsing. We see that our model outperforms the baselines in both parsing conditions achieving SoTA. When gold segmentation is provided, our model outperforms the single-task training model of (Lin et al., 2019) by 0.43%, 1.06% and 0.82% absolute in Span, Nuclearity and Relation, respectively. Our parser also surpasses their joint training model, which uses multi-task training (segmentation and parsing), with 0.61% and 0.4% absolute improvements in Nuclearity and Relation, respectively. For end-to-end parsing, compared to the best baseline (Lin et al., 2019), our model yields 0.27%, 0.67%, and 1.30% absolute improvements in Span, Nuclearity, Relation, respectively. This demonstrates the effectiveness of our conditional splitting approach and end-to-end formulation of the discourse analysis task. The fact that our model improves on span identification indicates that our method also yields better EDU segmentation.

Parsing Speed Comparison
We compare parsing speed of different models in Table 6. We ran our models on both CPU (Intel Xeon W-2133) and GPU (Nvidia GTX 1080 Ti).
Syntactic Parsing The Berkeley Parser and ZPar are two representative non-neural parsers without access to GPUs. Stern et al. (2017a) employ maxmargin training and perform top-down greedy decoding on CPUs. Meanwhile, Kitaev and Klein (2018); Zhou and Zhao (2019); Wei et al. (2020) use a self-attention encoder and perform decoding using Cython for acceleration. Zhang et al. (2020b) perform CKY decoding on GPU. The parser proposed by Gómez and Vilares (2018) is also efficient as it treats parsing as a sequence labeling task. However, its parsing accuracy is much lower compared to others (90.7 F1 in Table 1). We see that our parser is much more efficient than existing ones. It utilizes neural modules to perform splitting, which is optimized and parallelized with efficient GPU implementation. It can parse 1, 127 sentences/second, which is faster than existing parsers. In fact, there is still room to improve our speed by choosing better architectures, like the Transformer which has O(1) running time in encoding a sentence compared to O(n) of the bi-LSTM encoder. Moreover, allowing tree generation by splitting the spans/nodes at the same tree level in parallel at each step can boost the speed further. We leave these extensions to future work.
Discourse Parsing For measuring discourse parsing speed, we follow the same set up as Lin et al. (2019), and evaluate the models with the same 100 sentences randomly selected from the test set. We include the model loading time for all the systems. Since SPADE and CODRA need to extract a handful of features, they are typically slower than the neural models which use pretrained embeddings. In addition, CODRA's DCRF parser has a O(n 3 ) inference time complexity. As shown, our parser is 4.7x faster than the fastest end-to-end parser of Lin et al. (2019), making it not only effective but also highly efficient. Even when tested only on the CPU, our model is faster than all the other models which run on GPU or CPU, thanks End-to-End Discourse parsing (Segmenter + Parser) CODRA (Joty et al., 2015) 3.05 1.0x SPADE (Soricut and Marcu, 2003) 4.90 1.6x (Lin et al., 2019) 28.96 9.5x Our end-to-end parser (CPU) 59.03 19.4x Our end-to-end parser (GPU) 135.85 44.5x to the end-to-end formulation that does not need EDU segmentation beforehand.

Related Work
With the recent popularity of neural architectures, such as LSTMs (Hochreiter and Schmidhuber, 1997) and Transformers (Vaswani et al., 2017), various neural models have been proposed to encode the input sentences and infer their constituency trees. To enforce structural consistency, such methods employ either a greedy transition-based (Dyer et al., 2016;Liu and Zhang, 2017), a globally optimized chart parsing (Gaddy et al., 2018;Kitaev and Klein, 2018), or a greedy top-down algorithm (Stern et al., 2017a;Shen et al., 2018). Meanwhile, researchers also tried to cast the parsing problem into tasks that can be solved differently. For example, Gómez and Vilares (2018); Shen et al. (2018) proposed to map the syntactic tree of a sentence containing n tokens into a sequence of n − 1 labels or scalars. However, parsers of this type suffer from the exposure bias during inference. Beside these methods, Seq2Seq models have been used to generate a linearized form of the tree (Vinyals et al., 2015b;Kamigaito et al., 2017;Suzuki et al., 2018;Fernández-González and Gómez-Rodríguez, 2020a). However, these methods may generate invalid trees when the open and end brackets do not match.
In discourse parsing, existing parsers receive the EDUs from a segmenter to build the discourse tree, which makes them susceptible to errors when the segmenter produces incorrect EDUs (Joty et al., 2012(Joty et al., , 2015Lin et al., 2019;Zhang et al., 2020a;Liu et al., 2020). There are also attempts which model constituency and discourse parsing jointly (Zhao and Huang, 2017) and do not need to perform EDU preprocessing. It is based on the finding that each EDU generally corresponds to a constituent in constituency tree, i.e., discourse structure usually aligns with constituency structure. However, it has the drawback that it needs to build joint syntactodiscourse data set for training which is not easily adaptable to new languages and domains.
Our approach differs from previous methods in that it represents the constituency structure as a series of splitting representations, and uses a Seq2Seq framework to model the splitting decision at each step. By enabling beam search, our model can find the best trees without the need to perform an expensive global search. We also unify discourse segmentation and parsing into one system by generalizing our model, which has been done for the first time to the best of our knowledge.
Our splitting mechanism shares some similarities with Pointer Network (Vinyals et al., 2015a;Ma et al., 2018;Gómez-Rodríguez, 2019, 2020b) or head-selection approaches Kurita and Søgaard, 2019), but is distinct from them that in each decoding step, our method identifies the splitting point of a span and generates a new input for future steps instead of pointing to generate the next decoder input.

Conclusion
We have presented a novel, generic parsing method for constituency parsing based on a Seq2Seq framework. Our method supports an efficient top-down decoding algorithm that uses a pointing function for scoring possible splitting points. The pointing mechanism captures global structural properties of a tree and allows efficient training with a cross entropy loss. Our formulation, when applied to discourse parsing, can bypass discourse segmentation as a pre-requisite step. Through experiments we have shown that our method outperforms all existing top-down methods on English Penn Treebank and RST Discourse Treebank sentence-level parsing tasks. With pre-trained representations, our method rivals state-of-the-art methods, while being faster. Our model also establishes a new state-ofthe-art for sentence-level RST parsing.