The Limitations of Limited Context for Constituency Parsing

Incorporating syntax into neural approaches in NLP has a multitude of practical and scientific benefits. For instance, a language model that is syntax-aware is likely to be able to produce better samples; even a discriminative model like BERT with a syntax module could be used for core NLP tasks like unsupervised syntactic parsing. Rapid progress in recent years was arguably spurred on by the empirical success of the Parsing-Reading-Predict architecture of (Shen et al., 2018a), later simplified by the Order Neuron LSTM of (Shen et al., 2019). Most notably, this is the first time neural approaches were able to successfully perform unsupervised syntactic parsing (evaluated by various metrics like F-1 score). However, even heuristic (much less fully mathematical) understanding of why and when these architectures work is lagging severely behind. In this work, we answer representational questions raised by the architectures in (Shen et al., 2018a, 2019), as well as some transition-based syntax-aware language models (Dyer et al., 2016): what kind of syntactic structure can current neural approaches to syntax represent? Concretely, we ground this question in the sandbox of probabilistic context-free-grammars (PCFGs), and identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to when forced to make parsing decision. We show that with limited context (either bounded, or unidirectional), there are PCFGs, for which these approaches cannot represent the max-likelihood parse; conversely, if the context is unlimited, they can represent the max-likelihood parse of any PCFG.

However, even heuristic (much less fully mathematical) understanding of why and when these architectures work is lagging severely behind. In this work, we answer representational questions raised by the architectures in (Shen et al., 2018a(Shen et al., , 2019, as well as some transition-based syntax-aware language models (Dyer et al., 2016): what kind of syntactic structure can current neural approaches to syntax represent? Concretely, we ground this question in the sandbox of probabilistic context-free-grammars (PCFGs), and identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to when forced to make parsing decision. We show that with limited context (either bounded, or unidirectional), there are PCFGs, for which these approaches cannot represent the max-likelihood parse; conversely, if the context is unlimited, they can represent the max-likelihood parse of any PCFG.

Introduction
Neural approaches have been steadily making their way to NLP in recent years. By and large however, the neural techniques that have been scaled-up the most and receive widespread usage do not explicitly try to encode discrete structure that is natural to language, e.g. syntax. The reason for this is perhaps not surprising: neural models have largely achieved substantial improvements in unsupervised settings, BERT (Devlin et al., 2019) being the defacto method for unsupervised pre-training in most NLP settings. On the other hand unsupervised syntactic tasks, e.g. unsupervised syntactic parsing, have long been known to be very difficult tasks (Htut et al., 2018). However, since incorporating syntax has been shown to improve language modeling (Kim et al., 2019b) as well as natural language inference (Chen et al., 2017;Pang et al., 2019;He et al., 2020), syntactic parsing remains important even in the current era when large pre-trained models, like BERT (Devlin et al., 2019), are available.
Arguably, the breakthrough works in unsupervised constituency parsing in a neural manner were (Shen et al., 2018a(Shen et al., , 2019, achieving F1 scores 42.8 and 49.4 on the WSJ Penn Treebank dataset (Htut et al., 2018;Shen et al., 2019). Both of these architectures, however (especially Shen et al., 2018a) are quite intricate, and it's difficult to evaluate what their representational power is (i.e. what kinds of structure can they recover). Moreover, as subsequent more thorough evaluations show (Kim et al., 2019b,a), these methods still have a rather large performance gap with the oracle binary tree (which is the best binary parse tree according to F1-score) -raising the question of what is missing in these methods.
We theoretically answer both questions raised in the prior paragraph. We quantify the representational power of two major frameworks in neural approaches to syntax: learning a syntactic distance (Shen et al., 2018a(Shen et al., ,b, 2019 and learning to parse through sequential transitions (Dyer et al., 2016;Chelba, 1997). To formalize our results, we con-sider the well-established sandbox of probabilistic context-free grammars (PCFGs). Namely, we ask: When is a neural model based on a syntactic distance or transitions able to represent the maxlikelihood parse of a sentence generated from a PCFG?
We focus on a crucial "hyperparameter" common to practical implementations of both families of methods that turns out to govern the representational power: the amount and type of context the model is allowed to use when making its predictions. Briefly, for every position t in the sentence, syntactic distance models learn a distance d t to the previous token -the tree is then inferred from this distance; transition-based models iteratively construct the parse tree by deciding, at each position t, what operations to perform on a partial parse up to token t. A salient feature of both is the context, that is, which tokens is d t a function of (correspondingly, which tokens can the choice of operations at token t depend on)?
We show that when the context is either bounded (that is, d t only depends on a bounded window around the t-th token) or unidirectional (that is, d t only considers the tokens to the left of the tth token), there are PCFGs for which no distance metric (correspondingly, no algorithm to choose the sequence of transitions) works. On the other hand, if the context is unbounded in both directions then both methods work: that is, for any parse, we can design a distance metric (correspondingly, a sequence of transitions) that recovers it. This is of considerable importance: in practical implementations the context is either bounded (e.g. in Shen et al., 2018a, the distance metric is parametrized by a convolutional kernel with a constant width) or unidirectional (e.g. in Shen et al., 2019, the distance metric is computed by a LSTM, which performs a left-to-right computation).
This formally confirms a conjecture of Htut et al. (2018), who suggested that because these models commit to parsing decision in a left-to-right fashion and are trained as a part of a language model, it may be difficult for them to capture sufficiently complex syntactic dependencies. Our techniques are fairly generic and seem amenable to analyzing other approaches to syntax. Finally, while the existence of a particular PCFG that is problematic for these methods doesn't necessarily imply that the difficulties will carry over to real-life data, the PCFGs that are used in our proofs closely track lin-guistic intuitions about difficult syntactic structures to infer: the parse depends on words that come much later in the sentence.

Overview of Results
We consider several neural architectures that have shown success in various syntactic tasks, most notably unsupervised constituency parsing and syntax-aware language modeling. The general framework these architectures fall under is as follows: to parse a sentence W = w 1 w 2 ...w n with a trained neural model, the sentence W is input into the model, which outputs o t at each step t, and finally all the outputs {o t } n t=1 are utilized to produce the parse.
Given unbounded time and space resources, by a seminal result of Siegelmann and Sontag (1992), an RNN implementation of this framework is Turing complete. In practice it is common to restrict the form of the output o t in some way. In this paper, we consider the two most common approaches, in which o t is a real number representing a syntactic distance (Section 2.1) (Shen et al., 2018a(Shen et al., ,b, 2019 or a sequence of parsing operations (Section 2.2) (Chelba, 1997;Chelba and Jelinek, 2000;Dyer et al., 2016). We proceed to describe our results for each architecture in turn.

Syntactic distance
Syntactic distance-based neural parsers train a neural network to learn a distance for each pair of adjacent words, depending on the context surrounding the pair of words under consideration. The distances are then used to induce a tree structure (Shen et al., 2018a,b).
For a sentence W = w 1 w 2 ...w n , the syntactic distance between w t−1 and w t (2 ≤ t ≤ n) is defined as where c t is the context that d t takes into consideration 1 . We will show that restricting the surrounding context either in directionality, or in size, results in a poor representational power, while full context confers essentially perfect representational power with respect to PCFGs.
Concretely, if the context is full, we show: Theorem (Informal, full context). For sentence W generated by any PCFG, if the computation of d t has as context the full sentence and the position index under consideration, i.e. c t = (W, t) and d t = d(w t−1 , w t | c t ), then d t can induce the maximum likelihood parse of W .
On the flipside, if the context is unidirectional (i.e. unbounded left-context from the start of the sentence, and even possibly with a bounded look-ahead), the representational power becomes severely impoverished: Theorem (Informal, limitation of left-to-right parsing via syntactic distance). There exists a PCFG G such that for any distance measure d t whose computation incorporates only bounded context in at least one direction (left or right), e.g. c t = (w 0 , w 1 , ..., w t+L ) the probability that d t induces the max likelihood parse is arbitrarily low.
In practice, for computational efficiency, parametrizations of syntactic distances fall into the above assumptions of restricted context (Shen et al., 2018a). This puts the ability of these models to learn a complex PCFG syntax into considerable doubt. For formal definitions, see Section 4.2. For formal theorem statements and proofs, see Section 5.
Subsequently we consider ON-LSTM, an architecture proposed by Shen et al. (2019) improving their previous work (Shen et al., 2018a), which also is based on learning a syntactic distance, but in (Shen et al., 2019) the distances are reduced from the values of a carefully structured master forget gate (see Section 6). While we show ON-LSTM can in principle losslessly represent any parse tree (Theorem 3), calculating the gate values in a left to right fashion (as is done in practice) is subject to the same limitations as the syntactic distance approach: Theorem (Informal, limitation of syntactic distance estimation based on ON-LSTM). There exists a PCFG G for which the probability that the syntactic distance converted from an ON-LSTM induces the max likelihood parse is arbitrarily low.
For a formal statement, see Section 6 and in particular Theorem 4.

Transition-based parsing
In principle, the output o t at each position t of a left-to-right neural models for syntactic parsing need not be restricted to a real-numbered distance or a carefully structured vector. It can also be a combinatorial structure -e.g. a sequence of transitions (Chelba, 1997;Chelba and Jelinek, 2000;Dyer et al., 2016). We adopt a simplification of the neural parameterization in (Dyer et al., 2016) (see Definition 4.7).
With full context, Dyer et al. (2016) describes an algorithm to find a sequence of transitions to represent any parse tree, via a "depth-first, leftto-right traversal" of the tree. On the other hand, without full context, we prove that transition-based parsing suffers from the same limitations: Theorem (Informal, limitation of transition-based parsing without full context). There exists a PCFG G, such that for any learned transition-based parser with bounded context in at least one direction (left or right), the probability that it returns the max likelihood parse is arbitrarily low.
For a formal statement, see Section 7, and in particular Theorem 5.
Remark. There is no immediate connection between the syntactic distance-based approaches (including ON-LSTM) and the transition-based parsing framework, so the limitations of transitionbased parsing does not directly imply the stated negative results for syntactic distance or ON-LSTM, and vice versa.

The counterexample family
Most of our theorems proving limitations on bounded and unidirectional context are based on a PCFG family (Definition 2.1) which draws inspirations from natural language already suggested in (Htut et al., 2018): later words in a sentence can force different syntactic structures earlier in the sentence. For example, consider the two sentences: "I drink coffee with milk." and "I drink coffee with friends." Their only difference occurs at their very last words, but their parses differ at some earlier words in each sentence, too, as shown in Figure 1.
To formalize this intuition, we define the following PCFG.
Definition 2.1 (Right-influenced PCFG). Let m ≥ 2, L ≥ 1 be positive integers. The grammar G m,L has starting symbol S, other non-terminals and terminals a i for all i ∈ {1, 2, ..., m + 1 + L }, c j for all j ∈ {1, 2, ..., m}. Figure 1: The parse trees of the two sentences: "I drink coffee with milk." and "I drink coffee with friends.". Their only difference occurs at their very last words, but their parses differ at some earlier words in each sentence Figure 2: The structure of the parse tree of string l k = a 1 a 2 ...a m+1+L c k ∈ L(G m,L ). Note that any l k1 and l k2 are almost the same except for the last token: the prefix a 1 a 2 ...a m+1+L is shared among all strings in L(G m,L ). However, their parses differ with respect to where A k is split. The last token c k is unique to l k and hence determines the correct parse according to G m,L .
The rules of the grammar are in which → * means that the left expands into the right through a sequence of rules that conform to the requirements of the Chomsky normal form (CNF, Definition 4.4). Hence the grammar G m,L is in CNF.
The language of this grammar is The parse of an arbitrary l k is shown in Figure 2. Each l k corresponds to a unique parse determined by the choice of k. The structure of this PCFG is such that for the parsing algorithms we consider that proceed in a "left-to-right" fashion on l k , before processing the last token c k , it cannot infer the syntactic structure of a 1 a 2 ...a m+1 any better than randomly guessing one of the m possibilities. This is the main intuition behind Theorems 2 and 5.
Remark. While our theorems focus on the limitation of "left-to-right" parsing, a symmetric argument implies the same limitation of "right-to-left" parsing. Thus, our claim is that unidirectional context (in either direction) limits the expressive power of parsing models.

Related Works
Neural models for parsing were first successfully implemented for supervised settings, e.g. (Vinyals et al., 2015;Dyer et al., 2016;Shen et al., 2018b). Unsupervised tasks remained seemingly out of reach, until the proposal of the Parsing-Reading-Predict Network (PRPN) by Shen et al. (2018a), whose performance was thoroughly verified by extensive experiments in (Htut et al., 2018). The follow-up paper (Shen et al., 2019) introducing the ON-LSTM architecture simplified radically the architecture in (Shen et al., 2018a), while still ultimately attempting to fit a distance metric with the help of carefully designed master forget gates. Subsequent work by Kim et al. (2019a) departed from the usual way neural techniques are integrated in NLP, with great success: they proposed a neural parameterization for the EM algorithm for learning a PCFG, but in a manner that leverages semantic information as well -achieving a large improvement on unsupervised parsing tasks. 2 In addition to constituency parsing, dependency parsing is another common task for syntactic parsing, but for our analyses on the ability of various approaches to represent the max-likelihood parse of sentences generated from PCFGs, we focus on the task of constituency parsing. Moreover, it's important to note that there is another line of work aiming to probe the ability of models trained without explicit syntactic consideration (e.g. BERT) to nevertheless discover some (rudimentary) syntactic elements (Bisk and Hockenmaier, 2015;Linzen et al., 2016;Choe and Charniak, 2016;Kuncoro et al., 2018;Williams et al., 2018;Goldberg, 2019;Htut et al., 2019;Hewitt and Manning, 2019;Reif et al., 2019). However, to-date, we haven't been able to extract parse trees achieving scores that are close to the oracle binarized trees on standard benchmarks (Kim et al., 2019b,a).
Methodologically, our work is closely related to a long line of works aiming to characterize the representational power of neural models (e.g. RNNs, LSTMs) through the lens of formal languages and formal models of computation. Some of the works of this flavor are empirical in nature (e.g. LSTMs have been shown to possess stronger abilities to recognize some context-free language and even some context-sensitive language, compared with simple RNNs (Gers and Schmidhuber, 2001;Suzgun et al., 2019) or GRUs (Weiss et al., 2018;Suzgun et al., 2019)); some results are theoretical in nature (e.g. Siegelmann and Sontag (1992)'s proof that with unbounded precision and unbounded time complexity, RNNs are Turing-complete; related results investigate RNNs with bounded precision and computation time (Weiss et al., 2018), as well as 2 By virtue of not relying on bounded or unidirectional context, the Compound PCFG (Kim et al., 2019a) eschews the techniques in our paper. Specifically, by employing a bidirectional LSTM inference network in the process of constructing a tree given a sentence, the parsing is no longer "left-to-right". memory (Merrill, 2019;Hewitt et al., 2020). Our work contributes to this line of works, but focuses on the task of syntactic parsing instead.

Preliminaries
In this section, we define some basic concepts and introduce the architectures we will consider.

Probabilistic context-free grammar
First recall several definitions around formal language, especially probabilistic context free grammar: Definition 4.1 (Probabilistic context-free grammar (PCFG)). Formally, a PCFG (Chomsky, 1956) is a 5-tuple G = (Σ, N, S, R, Π) in which Σ is the set of terminals, N is the set of non-terminals, S ∈ N is the start symbol, R is the set of production rules of the form r = (r L → r R ), where r L ∈ N , r R is of the form B 1 B 2 ...B m , m ∈ Z + , and ∀i ∈ {1, 2, ..., m}, B i ∈ (Σ ∪ N ). Finally, Π : R → [0, 1] is the rule probability function, in which for any Definition 4.2 (Parse tree). Let T G denote the set of parse trees that G can derive. Each t ∈ T G is associated with yield(t) ∈ Σ * , the sequence of terminals composed of the leaves of t and P T (t) ∈ [0, 1], the probability of the parse tree, defined by the product of the probabilities of the rules in the derivation of t. Definition 4.3 (Language and sentence). The language of G is Each s ∈ L(G) is called a sentence in L(G), and is associated with the set of parses T G (s) = {t ∈ T G | yield(t) = s}, the set of max likelihood parses, arg max t∈T G (s) P T (t), and its probability P S (s) = t∈T G (s) P T (t). Definition 4.4 (Chomsky normal form (CNF)). A PCFG G = (Σ, N, S, R, Π) is in CNF (Chomsky, 1959) if we require, in addition to Definition 4.1, that each rule r ∈ R is in the form A → B 1 B 2 where B 1 , B 2 ∈ N \ {S}; A → a where a ∈ Σ, a = ; or S → which is only allowed if the empty string ∈ L(G). Every PCFG G can be converted into a PCFG G in CNF such that L(G) = L(G ) (Hopcroft et al., 2006).

Syntactic distance
The Parsing-Reading-Predict Networks (PRPN) (Shen et al., 2018a) is one of the leading approaches to unsupervised constituency parsing. The parsing network (which computes the parse tree, hence the only part we focus on in our paper) is a convolutional network that computes the syntactic distances d t = d(w t−1 , w t ) (defined in Section 2.1) based on the past L words. A deterministic greedy tree induction algorithm is then used to produce a parse tree as follows. First, we split the sentence w 1 ...w n into two constituents, w 1 ...w t−1 and w t ...w n , where t ∈ argmax{d t } n t=2 and form the left and right subtrees of t. We recursively repeat this procedure for the newly created constituents. An algorithmic form of this procedure is included as Algorithm 1 in Appendix A.
Note that, due to the deterministic nature of the tree-induction process, the ability of PRPN to learn a PCFG is completely contingent upon learning a good syntactic distance.

The ordered neuron architecture
Building upon the idea of representing the syntactic information with a real-valued distance measure at each position, a simple extension is to associate each position with a learned vector, and then use the vector for syntactic parsing. The ordered-neuron LSTM (ON-LSTM, Shen et al., 2019) proposes that the nodes that are closer to the root in the parse tree generate a longer span of terminals, and therefore should be less frequently "forgotten" than nodes that are farther away from the root. The difference in the frequency of forgetting is captured by a carefully designed master forget gate vectorf , as shown in Figure 3 (in Appendix B). Formally: Definition 4.5 (Master forget gates, Shen et al., 2019). Given the input sentence W = w 1 w 2 ...w n and a trained ON-LSTM, running the ON-LSTM on W gives the master forget gates, which are a sequence of D-dimensional vectors {f t } n t=1 , in which at each position t,f t =f t (w 1 , ..., w t ) ∈ [0, 1] D . Moreover, letf t,j represent the j-th dimension off t . The ON-LSTM architectures requires thatf t,1 = 0,f t,D = 1, and ∀i < j,f t,i ≤f t,j . When parsing a sentence, the real-valued master forget gate vectorf t at each position t is reduced to a single real number representing the syntactic distance d t at position t (see (1)) (Shen et al., 2018a). Then, use the syntactic distances to obtain a parse.

Transition-based parsing
In addition to outputting a single real numbered distance or a vector at each position t, a left-to-right model can also parse a sentence by outputting a sequence of "transitions" at each position t, an idea proposed in some traditional parsing approaches (Sagae and Lavie, 2005;Chelba, 1997;Chelba and Jelinek, 2000), and also some more recent neural parameterization (Dyer et al., 2016).
We introduce several items of notation: • z t i : the i-th transition performed when reading in w t , the t-th token of the sentence W = w 1 w 2 ...w n .
• N t : the number of transitions performed between reading in the token w t and reading in the next token w t+1 . • Z t : the sequence of transitions after reading in the prefix w 1 w 2 ...w t of the sentence.
• Z: the parse of the sentence W . Z = Z n .
We base our analysis on the approach introduced in the parsing version of (Dyer et al., 2016), though that work additionally proposes a generator version. 3 Definition 4.6 (Transition-based parser). A transition-based parser uses a stack (initialized to empty) and an input buffer (initialized with the sentence w 1 ...w t ). At each position t, based on a context c t , the parser outputs a sequence of parsing transitions {z t i } Nt i=1 , where each z t i can be one of the following transitions (Definition 4.7). The parsing stops when the stack contains one single constituent, and the buffer is empty.
Definition 4.7 (Parser transitions, Dyer et al., 2016). A parsing transition can be one of the following three types: • NT(X) pushes a non-terminal X onto the stack.
• SHIFT: removes the first terminal from the input buffer and pushes onto the stack.
• REDUCE: pops from the stack until an open non-terminal is encountered, then pops this non-terminal and assembles everything popped to form a new constituent, labels this new constituent using this non-terminal, and finally pushes this new constituent onto the stack.
In Appendix Section C, we provide an example of parsing the sentence "I drink coffee with milk" using the set of transitions given by Definition 4.7.
The different context specifications and the corresponding representational powers of the transitionbased parser are discussed in Section 7.

Representational Power of Neural Syntactic Distance Methods
In this section we formalize the results on syntactic distance-based methods. Since the tree induction algorithm always generates a binary tree, we consider only PCFGs in Chomsky normal form (CNF) (Definition 4.4) so that the max likelihood parse of a sentence is also a binary tree structure. To formalize the notion of "representing" a PCFG, we introduce the following definition: Definition 5.1 (Representing PCFG with syntactic distance). Let G be any PCFG in Chomsky Normal Form. A syntactic distance function d is said to be able to p-represent G if for a set of sentences in L(G) whose total probability is at least p, d can correctly induce the tree structure of the max likelihood parse of these sentences without ambiguity.
Remark. Ambiguities could occur when, for example, there exists t such that d t = d t+1 . In this case, the tree induction algorithm would have to break ties when determining the local structure for w t−1 w t w t+1 . We preclude this possibility in Definition 5.1.
In the least restrictive setting, the whole sentence W , as well as the position index t can be taken into consideration when determining each d t . We prove that under this setting, there is a syntactic distance measure that can represent any PCFG.
Theorem 1 (Full context). Let c t = (W, t). For each PCFG G in Chomsky normal form, there exists a syntactic distance measure d t = d(w t−1 , w t | c t ) that can 1-represent G.
Proof. For any sentence s = s 1 s 2 ...s n ∈ L(G), let T be its max likelihood parse tree. Since G is in Chomsky normal form, T is a binary tree. We will describe an assignment of {d t : 2 ≤ t ≤ n} such that their order matches the level at which the branches split in T . Specifically, ∀t ∈ [2, n], let a t denote the lowest common ancestor of w t−1 and w t in T . Let d t denote the shortest distance between a t and the root of T . Finally, let d t = n − d t . As a result, {d t : 2 ≤ t ≤ n} induces T .
Remark. Since any PCFG can be converted to Chomsky normal form (Hopcroft et al., 2006), Theorem 1 implies that given the whole sentence and the position index as the context, the syntactic distance has sufficient representational power to capture any PCFG. It does not state, however, that the whole sentence and the position are the minimal contextual information needed for representability nor does it address training (i.e. optimization) issues.
On the flipside, we show that restricting the context even mildly can considerably decrease the representational power. Namely, we show that if context is bounded even in a single direction (to the left or to the right), there are PCFGs on which any syntactic distance will perform poorly 4 . (Note in the implementation (Shen et al., 2018a) the context only considers a bounded window to the left.) Theorem 2 (Limitation of left-to-right parsing via syntactic distance). Let w 0 = S be the sentence start symbol. Let the context c t = (w 0 , w 1 , ..., w t+L ).
∀ > 0, there exists a PCFG G in Chomsky normal form, such that any syntactic distance measure Proof. Let m > 1/ be a positive integer. Consider the PCFG G m,L in Definition 2.1. For any k ∈ [m], consider the string l k ∈ L(G m,L ). Note that in the parse tree of l k , the rule S → A k B k is applied. Hence, a k and a k+1 are the unique pair of adjacent non-terminals in a 1 a 2 ...a m+1 whose lowest common ancestor is the closest to the root in the parse tree of l k . Then, in order for the syntactic distance metric d to induce the correct parse tree for l k , d k must be the unique maximum in {d t : 2 ≤ t ≤ m + 1}.
However, d is restricted to be in the form d t = d(w t−1 , w t | w 0 , w 1 , ..., w t+L ).
Note that ∀1 ≤ k 1 < k 2 ≤ m, the first m + 1 + L tokens of l k 1 and l k 2 are the same, which implies that the inferred syntactic distances are the same for l k 1 and l k 2 at each position t. Thus, it is impossible for d to induce the correct parse tree for both l k 1 and l k 2 . Hence, d is correct on at most one l k ∈ L(G m,L ), which corresponds to probability at most 1/m < . Therefore, d cannot -represent G m,L .
Remark. In the counterexample, there are only m possible parse structures for the prefix a 1 a 2 ...a m+1 . Hence, the proved fact that the probability of being correct is at most 1/m means that under the restrictions of unbounded look-back and bounded look-ahead, the distance cannot do better than random guessing for this grammar. Remark. The above Theorem 2 formalizes the intuition discussed in (Htut et al., 2018) outlining an intrinsic limitation of only considering bounded context in one direction. Indeed, for the PCFG constructed in the proof, the failure is a function of the context, not because of the fact that we are using a distance-based parser. Note that as a corollary of the above theorem, if there is no context (c t = null) or the context is both bounded and unidirectional, i.e.
then there is a PCFG that cannot be -represented by any such d.

Representational Power of the Ordered Neuron Architecture
In this section, we formalize the results characterizing the representational power of the ON-LSTM architecture. The master forget gates of the ON-LSTM, {f t } n t=2 in which eachf t ∈ [0, 1] D , encode the hierarchical structure of a parse tree, and Shen et al. (2019) proposes to carry out unsupervised constituency parsing via a reduction from the gate vectors to syntactic distances by setting: First we show that the gates in ON-LSTM in principle form a lossless representation of any parse tree.
Theorem 3 (Lossless representation of a parse tree). For any sentence W = w 1 w 2 ...w n with parse tree T in any PCFG in Chomsky normal form, there exists a dimensionality D ∈ Z + , a sequence of vectors {f t } n t=2 in which eachf t ∈ [0, 1] D , such that the estimated syntactic distances via (1) induce the structure of T .
Proof. By Theorem 1, there is a syntactic distance measure {d t } n t=2 that induces the structure of T (such that ∀t, d t = d t+1 ).
For each t = 2..n, setd t = k if d t is the k-th smallest entry in {d t } n t=2 , breaking ties arbitrarily. Then, eachd t ∈ [1, n − 1], and {d t } n t=2 also induces the structure of T .
Although Theorem 3 shows the ability of the master forget gates to perfectly represent any parse tree, a left-to-right parsing can be proved to be unable to return the correct parse with high probability. In the actual implementation in (Shen et al., 2019), the (real-valued) master forget gate vectors {f t } n t=1 are produced by feeding the input sentence W = w 1 w 2 ...w n to a model trained with a language modeling objective. In other words,f t,j is calculated as a function of w 1 , ..., w t , rather than the entire sentence.
As such, this left-to-right parser is subject to similar limitations as in Theorem 2: Theorem 4 (Limitation of syntactic distance estimation based on ON-LSTM). For any > 0, there exists a PCFG G in Chomsky normal form, such that the syntactic distance measure calculated with (1),d f t , cannot -represent G.
Proof. Since by Definition 4.5,f t,j is a function of w 1 , ..., w t , the estimated syntactic distanced f t is also a function of w 1 , ..., w t . By Theorem 2, even with unbounded look-back context w 1 , ..., w t , there exists a PCFG for which the probability that d f t induces the correct parse is arbitrarily low.

Representational Power of Transition-Based Parsing
In this section, we analyze a transition-based parsing framework inspired by (Dyer et al., 2016;Chelba and Jelinek, 2000;Chelba, 1997). Again, we proceed to say first that "full context" confers full representational power. Namely, using the terminology of Definition 4.6, we let the context c t at each position t be the whole sentence W and the position index t. Note that any parse tree can be generated by a sequence of transitions defined in Definition 4.7. Indeed, Dyer et al. (2016) describes an algorithm to find such a sequence of transitions via a "depth-first, left-to-right traversal" of the tree.
Proceeding to limited context, in the setting of typical left-to-right parsing, the context c t consists of all current and past tokens {w j } t j=1 and all previous parses {(z j 1 , ..., z j N j )} t j=1 . We'll again prove even stronger negative results, where we allow an optional look-ahead to L input tokens to the right. Theorem 5 (Limitation of transition-based parsing without full context). For any > 0, there exists a PCFG G in Chomsky normal form, such that for any learned transition-based parser (Definition 4.6) based on context , the sum of the probabilities of the sentences in L(G) for which the parser returns the maximum likelihood parse is less than .
Proof. Let m > 1/ be a positive integer. Consider the PCFG G m,L in Definition 2.1.
Note that ∀k, S → A k B k is applied to yield string l k . Then in the parse tree of l k , a k and a k+1 are the unique pair of adjacent terminals in a 1 a 2 ...a m+1 whose lowest common ancestor is the closest to the root. Thus, different l k requires a different sequence of transitions within the first m + 1 input tokens, i.e. {z t i } i≥1, 1≤t≤m+1 . For each w ∈ L(G m,L ), before the last token w m+2+L is processed, based on the common prefix w 1 w 2 ...w m+1+L = a 1 a 2 ...a m+1+L , it is equally likely that w = l k , ∀k, w. prob. 1/m each.
Moreover, when processing w m+1 , the bounded look-ahead window of size L does not allow access to the final input token a m+2+L = c k .
Thus, ∀1 ≤ k 1 < k 2 ≤ m, it is impossible for the parser to return the correct parse tree for both l k 1 and l k 2 without ambiguity. Hence, the parse is correct on at most one l k ∈ L(G), which corresponds to probability at most 1/m < .

Conclusion
In this work, we considered the representational power of two frameworks for constituency parsing prominent in the literature, based on learning a syntactic distance and learning a sequence of iterative transitions to build the parse tree -in the sandbox of PCFGs. In particular, we show that if the context for calculating distance/deciding on transitions is limited at least to one side (which is typically the case in practice for existing architectures), there are PCFGs for which no good distance metric/sequence of transitions can be chosen to construct the maximum likelihood parse.
This limitation was already suspected in (Htut et al., 2018) as a potential failure mode of leading neural approaches like (Shen et al., 2018a(Shen et al., , 2019 and we show formally that this is the case. The PCFGs with this property track the intuition that bounded context methods will have issues when the parse at a certain position depends heavily on latter parts of the sentence. The conclusions thus suggest re-focusing our attention on methods like (Kim et al., 2019a) which have enjoyed greater success on tasks like unsupervised constituency parsing, and do not fall in the paradigm analyzed in our paper. A question of definite further interest is how to augment models that have been successfully scaled up (e.g. BERT) in a principled manner with syntactic information, such that they can capture syntactic structure (like PCFGs). The other question of immediate importance is to understand the interaction between the syntactic and semantic modules in neural architectures -information is shared between such modules in various successful architectures, e.g. (Dyer et al., 2016;Shen et al., 2018aShen et al., , 2019Kim et al., 2019a), and the relative pros and cons of doing this are not well understood. Finally, our paper purely focuses on representational power, and does not consider algorithmic and statistical aspects of training. As any model architecture is associated with its distinct optimization and generalization considerations, and natural language data necessitates the modeling of the interaction between syntax and semantics, those aspects of considerations are well beyond the scope of our analysis in this paper using the controlled sandbox of PCFGs, and are interesting directions for future work.

A Tree Induction Algorithm Based on Syntactic Distance
The following algorithm is proposed in (Shen et al., 2018a) to create a parse tree based on a given syntactic distance.

Algorithm 1: Tree induction based on syntactic distance
Data: Sentence W = w 1 w 2 ...w n , syntactic distances d t = d(w t−1 , w t | c t ), 2 ≤ t ≤ n Result: A parse tree for W Initialize the parse tree with a single node n 0 = w 1 w 2 ...w n ; while ∃ leaf node n = w i w i+1 ...w j where i < j do Find k ∈ arg max i+1≤k≤j d k ; Create the left child n l and the right child n r of n ; n l ← w i w i+1 ...w k−1 ; n r ← w k w k+1 ...w j ; end return The parse tree rooted at n 0 .

B ON-LSTM Intuition
See Figure 3 below, which is excerpted from (Shen et al., 2019) with minor adaptation to the notation. Figure 3: Relationship between the parse tree, the block view, and the ON-LSTM. Excerpted from (Shen et al., 2019) with minor adaptation to the notation. Table 1 below shows an example of parsing the sentence "I drink coffee with milk" using the set of transitions given by Definition 4.7, which employs the parsing framework of (Dyer et al., 2016). The parse tree of the sentence is given by