Unsupervised Discontinuous Constituency Parsing with Mildly Context-Sensitive Grammars

We study grammar induction with mildly context-sensitive grammars for unsupervised discontinuous parsing. Using the probabilistic linear context-free rewriting system (LCFRS) formalism, our approach fixes the rule structure in advance and focuses on parameter learning with maximum likelihood. To reduce the computational complexity of both parsing and parameter estimation, we restrict the grammar formalism to LCFRS-2 (i.e., binary LCFRS with fan-out two) and further discard rules that require O(l6) time to parse, reducing inference to O(l5). We find that using a large number of nonterminals is beneficial and thus make use of tensor decomposition-based rank-space dynamic programming with an embedding-based parameterization of rule probabilities to scale up the number of nonterminals. Experiments on German and Dutch show that our approach is able to induce linguistically meaningful trees with continuous and discontinuous structures.


Introduction
Unsupervised parsing aims to induce hierarchical linguistic structures given only the strings in a language. A classic approach to unsupervised parsing is through probabilistic grammar induction (Lari and Young, 1990), which learns a probabilistic grammar (i.e., a set of rewrite rules and their probabilities) from raw text. Recent work has shown that neural parameterizations of probabilistic contextfree grammars (PCFG), wherein the grammar's rule probabilities are given by a neural network over shared symbol embeddings, can achieve promising results on unsupervised constituency parsing (Kim et al., 2019;Jin et al., 2019Jin et al., , 2021Yang et al., 2021bYang et al., , 2022. However, context-free rules are not natural for modeling discontinuous language phenomena such as extrapositions, cross-serial dependencies, and wh-movements. Mildly context- sensitive grammars (Joshi, 1985), which sit between context-free and context-sensitive grammars in the classic Chomsky-Schützenberger hierarchy (Chomsky, 1959;Chomsky and Schützenberger, 1963), are powerful enough to model richer aspects of natural language including discontinuous and non-local phenomena. 1 And despite their expressivity they enjoy polynomial-time inference algorithms, making them attractive both as cognitively plausible models of human language processing and as targets for unsupervised learning. There are several weakly equivalent grammatical formalisms for generating mildly context-sensitive languages (Vijay-Shanker and Weir, 1994): tree adjoining grammars (Joshi, 1975), head grammars (Pollard, 1985), combinatory categorial grammars (Steedman, 1987), and linear indexed grammars (Gazdar, 1988). In this paper we work with linear context-free rewriting systems (LCFGS, Vijay-Shanker et al., 1987), which generalize the above formalisms and are weakly equivalent to multiple context-free grammars (Seki et al., 1991).
Derivation trees in an LCFRS directly correspond to discontinuous constituency trees where each node can dominate a non-contiguous sequence of words in the yield, as shown in Fig. 1.
We focus on the LCFRS formalism as it has previously been successfully employed for supervised discontinuous constituency parsing (Levy, 2005;Maier, 2010;van Cranenburgh et al., 2016). The complexity of parsing in a LCFRS is O(n 3k |G|), where n is the sentence length, k is the fan-out (the maximum number of contiguous blocks of text that can be dominated by a nonterminal), and |G| is the grammar size. While polynomial, this is too high to be practical for unsupervised learning on real-world data. We thus restrict ourselves to to LCFRS-2, i.e., binary LCFRS with fan-out two, which has been shown to have high coverage on discontinuous treebanks (Maier et al., 2012). Even with this restriction LCFRS-2 remains difficult to induce from raw text due to the O(n 6 |G|) dynamic program for parsing and marginalization. However Corro (2020) observe that a O(n 5 |G|) variant of the grammar that discards certain rules can still recover 98% of real world treebank constituents. Our approach uses with this restricted variant of LCFRS-2 (see Sec 2.2). Finally, following recent work which finds that that overparameterizing deep latent variable models is beneficial for unsupervised learning (Buhai et al., 2020;Yang et al., 2021b;Chiu and Rush, 2020;Chiu et al., 2021), we scale LCFRS-2 to a large number of nonterminals by adapting tensor-decompositionbased inference techniques-originally developed for PCFGs (Cohen et al., 2013;Yang et al., 2021bYang et al., , 2022-to the LCFRS case. We conduct experiments German and Dutchboth of which have frequent discontinuous and non-local language phenomena and have available discontinuous treebanks-and observe that our approach is able to induce grammars with nontrivial performance on discontinuous constituents.  Rabanser et al., 2017) to decompose the 3D binary rule probability tensor T ∈ R m×m×m as, where u q , v q , w q ∈ R m , r is the tensor rank (a hyperparameter), and ⊗ is the outer product. Letting U, V, W ∈ R r×m be the matrices results from stacking all u q , v q , w q , Cohen et al. (2013) give the following recursive formula for calculating the inside tensor α ∈ R (n+1)×(n+1)×m for a sentence of length n: Here α L , α R ∈ R (n+1)×(n+1)×r are auxiliary tensors for storing intermediate values, and • is the Hadamard product. The resulting complexity of this version of the inside algorithm is O(n 3 r + n 2 mr), which removes the cubic dependence on m.
Based on this formula, Yang et al. (2021b) propose a low-rank neural parameterization which uses a neural network over shared symbol embeddings to produce unnormalized score matricesŪ ,V ,W . Then,Ū is softmax-ed across columns to obtain U , whileV ,W are softmax-ed across rows to obtain V, W . The difference between Cohen et al. (2013) and Yang et al. (2021b) is that the former performs CPD on an existing probability tensor T for faster (supervised) parsing, whereas the latter directly parameterizes and learns U, V, W from data without actually instantiating T. Yang et al. (2022) build on Yang et al. (2021b) and further pre-compute matrices J = V U T , K = W U T to rewrite the above recursive formula as: where α ∈ R (n+1)×(n+1)×r is an auxiliary inside score tensor. The resulting complexity of this approach is O(n 3 r + n 2 r 2 ), which is smaller than O(n 3 r + n 2 mr) when r m, i.e., in the setting with a large number of nonterminals whose probability tensor is of low rank. In this paper we adapt this low rank neural parameterization to the LCFRS case to scale to a large number of nonterminals.

Restricted LCFRS
In an LCFRS, a single nonterminal node can dominate a tuple of strings that need not be adjacent in the yield. The tuple size is referred to as the fanout. We mark the fan-out of each non-leaf node in Fig. 1. The fan-out of an LCFRS is defined as the maximal fan-out among all its nonterminals, and influences expressiveness and parsing complexity. For a binary LCFRS (LCFRS with derivation rules that have at most two nonterminals on the right hand side) with fan-out k, the parsing complexity is O(n 3k ). 2 In this paper we work with binary LCFRS with fan-out 2 (Stanojević and Steedman, 2020, LCFRS-2), which is expressive enough to model discontinuous constituents but still efficient enough to enable practical grammar induction from natural language data. This choice is also motivated by Maier et al. (2012) who observe that restricting the fan-out to two suffices for capturing a large proportion of discontinuous constituents in real world treebanks. 3 However, LCFRS-2's inference complexity of O(n 6 |G|) is still too expensive for practical unsupervised learning. We follow Corro (2020) and discard all rules that require O(n 6 ) time to parse, which reduces parsing complexity to O(n 5 |G|). 4 Formally, this restricted LCFRS-2 is a 6-tuple G = (S, N 1 , N 2 , P, Σ, R) where: S is the start symbol; N 1 , N 2 are a finite set of nonterminal symbols of fan-out one and two, respectively; P is a finite set of preterminal symbols; Σ is a finite set of terminal symbols; and R is a set of rules of the following form (where M N 1 ∪ P): A(x) indicates that A has a fan-out 1. A(x, y) 2 A binary CFG is thus a special case of a binary LCFRS with fan-out one, and parsing in this case reduces to the classic CKY algorithm.
Deductive rules: indicates that A has a fan-out 2 and x and y are non-adjacent contiguous strings in the yield of of A. Here each nonterminal is annotated with lowercase letters that stand for strings, xy denotes the concatenation of x and y, which are adjacent, into a single string s xy.
i k B C A j m n Illustrative Example. As an example of how this LCFRS can model discontinuous spans, we depict the rule A(xy, z) →B(x)C(y, z) above. B is a fan-out-1 node whose yield is x = w i · · · w k−1 and C is a fan-out-2 node whose first span is y = w k · · · w j−1 and whose second span is z = w m · · · w n−1 . A is the parent node of B, C, and inherits the yields of B and C, where x is concatenated with y to form a contiguous span and z is a standalone span.
Parsing. Table 1 gives the parsing-asdeduction (Pereira and Warren, 1983) description of the CKY-style chart parsing algorithm of our restricted LCFRS-2.

Tensor decomposition-based neural parameterization
We now describe a parameterization of LCFRS-2 that combines a neural parameterization with tensor decomposition, which makes it possible to scale LCFRS-2 to thousands of nonterminals. Let m 1 = |N 1 |, m 2 = |N 2 |, p = |P|, and m = m 1 + p. The rules involving A ∈ N 1 on the left hand side are 1a and 2a , whose probabilities can be represented by 3D tensors C 1 ∈ R m 1 ×m×m and D 1 ∈ R m 1 ×m×m 2 . For A ∈ N 2 , the relevant rules are 1b , 2b , 2c , 2d , 2e , whose probabilities can be represented by 3D tensors C 2 ∈ R m 2 ×m×m and D 3 , D 4 , D 5 , D 6 ∈ R m 2 ×m×m 2 . We stack D 3 , D 4 , D 5 , D 6 into a single 4D tensor D 2 ∈ R m 2 ×m×m 2 ×4 to leverage the structural similarity of these rules. Since these tensors are probabilities, we must have Tensor decomposition. To scale up the LCFRS-2 to a large number of nonterminals, we first apply CPD on all the binary rule probability tensors, where U :,q denotes the q-th column of U . The dimensions of these tensors are U 1 ∈ R m 1 ×r 1 , Here r 1 , r 2 , r 3 , r 4 are the ranks of the tensors that control inference complexity.
To ensure that these factorization leads to valid probability tensors, we additionally impose the following restrictions: (1) all decomposed matrices are non-negative; It is easy to verify that Eq 1 and 2 are satisfied if the above requirements are satisfied.
For unsupervised learning, we need to compute the marginal likelihood of a sentence p(w 1 w 2 . . . w n ). We give the rank-space dynamic program (i.e., the inside algorithm) for computing p(w 1 w 2 . . . w n ) in this tensor decomposition-based LCFRS-2 in Appdx. A. The resulting complexity is dominated by O(n 5 r 4 + n 4 (r 3 + r 4 )(r 2 + r 4 )). We thus set r 4 to a very small value (e.g., 4), which greatly reduces the total amount of training and evaluation time.
Parameterization. Following prior work on neural parameterizations of grammars (Jiang et al., 2016;Kim et al., 2019), we parameterize the component matrices to be the output of neural networks over shared embeddings.
The symbol embeddings are given by: E 1 ∈ R m×d where the first m 1 rows correspond to fanout-1 nonterminal embeddings and the last p rows are the preterminal embeddings; E 2 ∈ R m 2 ×d for the fan-out-2 nonterminal embedding matrix; r ∈ R d for the start symbol embedding. We also have four sets of "rank embeddings" R 1 ∈ R r 1 ×d , R 2 ∈ R r 2 ×d , R 3 ∈ R r 3 ×d , and R 4 ∈ R r 4 ×d . Given this, the entries of the U, V, W matrices are given by, are normalized according to the requirements described in the previous subsection. We share the parameters of the following MLP pairs: as they play similar roles (e.g., f 1 V and f 3 V are both applied to left children). For the D 2 tensor we also require the matrix P ∈ R 4×r 4 , and this is given by P = f P (R 4 ), where f P is a one-layer residual network with output size 4 that is normalized via a softmax across the last dimension.
Finally, for the starting and the terminal distributions we have , which results in s ∈ R m 1 (i.e., the probability vector for rules of the form S → A) and Q ∈ R p×v (i.e., probability matrix for rules of the form T (w) → w). Here E 1 m 1 : is the last p rows of E 1 , and f s and f Q are residual MLPs with softmax applied in the last layer to ensure that s and Q are valid probabilities.
Decoding. While the rank-space inside algorithm enables efficient computation of sentence likelihoods, direct CKY-style argmax decoding in this grammar requires instantiating the full probability tensors and is thus computationally intractable. We follow Yang et al. (2021b) and use Minimal Bayes Risk (MBR) decoding (Goodman, 1996). This involves obtaining the posterior probability of a span's being a constituent via the inside-outside algorithm (which has the same asymptotic complexity as the inside algorithm), and then using these probabilities as input into CKY. The complexity of CKY with these posterior probabilities is independent of the number of nonterminals in the original grammar, and thus takes O(n 5 ). This approach can be seen as finding the tree that has the largest number of expected constituents (Smith and Eisner, 2006). See Appd. A for more details.

Empirical Study
Data. We conduct experiments with our Tensor decomposition-based Neural LCFRS (TN-LCFRS) on German and Dutch, where discontinuous phenomena are more common (than in English). For German we concatenate TIGER (Brants et al., 2001) and NEGRA (Skut et al., 1997) as our training set, while for Dutch we use the LASSY Small Corpus treebank (van Noord et al., 2013). The data split can be found in Appd. B.1. For processing we use disco-dop 5 (van Cranenburgh et al., 2016) and discard all punctuation marks. We further take the most frequent 10,000 words for each language as the vocabulary, similar to the standard setup in unsupervised constituency parsing (Shen et al., 2018(Shen et al., , 2019Kim et al., 2019).
Hyperparameters. For hyperparameters, we choose |P| ∈ {45, 450, 4500} and set the number of fan-out one and fan-out two nonterminals to be Varying the grammar size this way allows us to test the effects of latent variable overparameterization, which has previously been shown to be helpful for structure induction (Buhai et al., 2020). The rank of the probability tensors are set to r 1 = r 3 = 400, r 2 = r 4 = 4, and the dimensionality of the embedding space is d = 512. Model parameters are initialized with Xavier uniform initialization. More training details and hyperparameters can be found in Appd. B.3. 5 https://github.com/andreasvc/disco-dop Baselines. Our baselines include: the neural PCFG (N-PCFG) and the compound PCFG (C-PCFG) (Kim et al., 2019), which cannot directly predict discontinuous constituents 6 but still serve as strong baselines for overall F1 since the majority of spans in these treebanks are continuous; and their direct extensions, neural LCFRS (N-LCFRS) and compound LCFRS (C-LCFRS), which do not employ the tensor-based low-rank factorization. These non-low-rank models have high computational complexity and hence we set |P| = 45 for these models. When |P| = 4500, we also compare TN-LCFRS with TN-PCFG (Yang et al., 2021b).
Evaluation. We use unlabeled corpus-level F1 to evaluate unsupervised parsing performance. We report both overall F1 and discontinuous F1 (DF1). For all experiments, we report the mean results and standard deviations over four independent runs with different random seeds. See Appd. B.2 for details. Table 2 shows the main results. With smaller grammars (|P| = 45), we find that both neural/compound LCFRSs have lower F1 than their PCFG counterparts, despite being able to predict discontinuous constituent spans. On the other hand, TN-LCFRS achieves better F1 than N-LCFRS even though it is a more restricted model (since it assumes that the rule probability tensors are of low rank), showing the benefits of parameter sharing through low rank factorizations. As we scale up TN-LCFRSs with |P| ∈ {45, 450, 4500} we observe continuous improvements in performance, with TN-LCFRS 4500 achieving the best F1 and DF1 on all three datasets. These results all outperform trivial (left branching, right branching, and random tree) baselines.

Main results
As an upper bound we also train a supervised model with TN-LCFRS 4500 . For supervised training we use the optimal binarization from Gildea (2010) to binarize treebanks, remove all trees that are unrecognizable by our restricted LCFRS, and maximize the joint probability of observed sentences and their corresponding unlabeled binarized trees by marginalizing over latent nonterminal symbols. We also show the maximum possible performance with oracle binary trees with this optimal   binarization. While the discontinuous F1 of our unsupervised parsers are nontrivial, there is still a large gap between the unsupervised and supervised scores (and also between the supervised and the oracle scores), indicating opportunities for further work in this area.

Analysis
Recall by constituent label. Table 3 shows the recall by constituent tag for the different models.
Overall the unsupervised methods do well on noun phrases (NP), prepositional phrases (PP) and proper nouns (PN), with some of the models approach the supervised baselines. Verb phrases (VP) and adjective phrases (AP) remain challenging. Table 4 has recall by label for discontinuous constituents only, where we observe that most discontinuous constituents are VPs. In Appd. C , we also show F1/DF1 broken down by sentence length.
Approximation error. Approximation error in the context of unsupervised learning arises due to the mismatch between the EM objective (i.e., log marginal likelihood) and structure recovery (i.e., F1), and is related to model misspecification (Liang and Klein, 2008). Figure 2 (  training/dev perplexity as well as the dev F1/DF1 as a function of the number of epochs. We find that larger grammars result in better performance in terms of both perplexity and structure recovery, which ostensibly indicates that the unsupervised objective is positively correlated with structure induction performance. However, when we first perform supervised learning on the log joint likelihood and then switch to unsupervised learning with log marginal likelihood (Figure 2, right), we find that while perplexity improves when we switch to the unsupervised objective, structure induction performance deteriorates. 7 Still, the difference in F1 before and after switching to the unsupervised objective is less for larger models, confirming the benefits of using a large number of nonterminals/preterminals.

Even more restricted LCFRS formalisms.
There are even more restricted versions of LCFRSs which have faster parsing (e.g. O(n 3 ), O(n 4 )) but can still model discontinuous constituents. In the supervised case, these restricted variants have been shown to perform almost as well as the more expressive O(n 5 ) and O(n 6 ) variants (Corro, 2020). In the unsupervised case however, we observe in Table 5 that disallowing O(n 5 ) rules ( 2b , 2c , 2d , 2e ) significantly degrades discontinuous F1 scores. We posit that this phenomena is related to empirical benefits of latent variable overparameterization-while in theory it is possible to model most discontinuous phenomena with more restricted rules, making the generative model more expressive via "overparameterizing" in rule space empirically leads to better performance.
Parameter sharing. As shown in Table 5, it was important to share the symbol embeddings across the different rules. Sharing the parameters of the MLPs as described in Sec. 2.3 was also found to be helpful. This highlights the benefits of working with neural parameterizations of grammars which  trees. In the first sentence, the crossing dependency occurs due to the initial adverb ("So")'s being analyzed as a dependent of the non-finite verb phrase at the end of the sentence which occurs due to German V2 word order. Our parser correctly predicts this dependency, although the subject NP (which itself is correctly identified) has the wrong internal structure. The second sentence highlights a case of partial success with rightextraposed relative clauses. While our model is able to correctly predict the top-level discontinuous constituent "[Für 15 200 Mark]−[Lampen einbauen lassen die mutwilligen Zerstörungen standhalten]", the parser does not adopt a discontinuousconstituency analysis of the right-extraposed relative clause itself ("[Lampen]-[die mutwilligen Zerstörungen standhalten]"). Instead it makes the relative clause a part of the non-finite verb complex, which does not conform to the annotation guidelines but is still nonetheless linguistically plausible. Sentence initial adverbs in the context of auxiliary verb constructions and right-extraposed relative clauses describe two common instances of discontinuous phenomena in German. Wh-questions constitute another potential class of discontinuous phenomena; however, these are not treated as discontinuous in TIGER/NEGRA. See Appd. D for more examples (including on Dutch).

Discussion and Limitations
We tried our approach on the discontinuous version of the English Penn Treebank (DPTB, Evang and Kallmeyer, 2011) but failed to induce any meaningful discontinuous structures, possibly because discontinuous phenomena in English are much less common than in German and Dutch. For example, while 5.67% of the gold constituents are discontinuous in NEGRA, only 1.84% gold constituents are discontinuous in DPTB Corro (2020).
The neural LCFRS was also quite sensitive to  hyperparameters and parameterization. The instability of unsupervised structure induction is widely acknowledged and could potentially be mitigated by a large amount of training data, as suggested by Liang and Klein (2008) and Pate and Johnson (2016). Due to this sensitivity, we rely on dev sets for some modeling choices (e.g., rank of the probability tensors). Hence, our approach is arguably not fully unsupervised in the strictest sense of the term, although this is a common setup in unsupervised parsing due to the mismatch between the unsupervised learning objective and structure recovery. 8 Finally, while we observed significant increases in performance as we scaled up the number of nonterminals, we also observed diminishing returns. Further scaling up the grammar is thus unlikely to close the (large) gap that still exists between the unsupervised and supervised parsing results.

Conclusion
This work studied unsupervised discontinuous constituency parsing with mildly context-sensitive grammars. By using a tensor decompositionbased neural parameterization of linear context-free rewriting systems, our approach was able to induce grammars that had nontrivial discontinuous parsing performance on German and Dutch. Whether even more expressive grammars will eventually lead to models that are competitive with pure neural language models (as a language model) remains an open question. A Fast LCFRS Inference with CPD Yang et al. (2022) propose a family of CPD-based algorithms for fast inference in B-FGGs which combine B-graphs (Klein and Manning, 2001) and factor graph grammars (FGG, Chiang and Riley, 2020). Inference in LCFRS is subsumed by B-FGG because for each rule, the number of variables in the left-hand side is always one. As such, we can adopt the method of Yang et al. (2022) to perform fast dynamic programming inference in "rank space" for our restricted LCFRS-2.
The base cases are, where Q :,x i is the x i -th column of Q. The recursive DP computation formulas are, Items: where can pre-computed before inference. The partition function Z (i.e., the sentence likelihood) is then given by, where R 1 = s T U 1 and R 2 = s T U 2 .
MBR decoding. MBR decoding aims to find the best parse with maximum expected number of constituent spans, which can be decomposed into two steps: i) span marginal estimation, and ii)  CKY-style parsing with marginals. Denote continuous and discontinuous span marginals as X ∈ R N ×N and Y ∈ R N ×N ×N ×N with ij X ij + ijmn Y ijmn = 2n − 1. Span marginals can be estimated via inside-outside, or equivalently, backprogation on the inside algorithm (Eisner, 2016, Sec. 6.2), i.e., The second-stage CKY-style parsing is similar to the description in Table 1, except that the grammar rule probabilities are replaced with span marginals, as described in Table 6. The total time complexity is dominated by the first stage of marginal estimation, whose complexity is the same as that of the inside algorithm (Eisner, 2016).

B.1 Data split
For German, we follow Corro (2020) and use the NEGRA treebank (Skut et al., 1997) with the split proposed by Dubey and Keller (2003), and the TIGER treebank (Brants et al., 2001) with the split provided by the SPRML 2014 shared task (Seddah et al., 2014). For Dutch, there is no standard split in the discontinuous parsing literature. We follow UD-Dutch-Alpino (Bouma and van Noord, 2017) and use a hybrid training dataset that comprises the whole Alpino treebank (van der Beek et al., 2001) and a subset of LASSY Small Corpus (van Noord et al., 2013). We further use the whole WR-P-P-H section and WR-P-P-L section as the development and test sets, respectively.

B.2 Evaluation metric details
Following standard practice in unsupervised parsing evaluation, we ignore all trivial continuous spans, i.e., whole-sentence spans and single-word spans. In addition, we ignore all discontinuous spans of fan-out greater than two. Finally, we evaluate only on sentences of length up to 40 due to computational considerations.

B.3 Training details
For training, we use a curriculum training strategy (Bengio et al., 2009) where we train only on sentences of length up to 30 in the first epoch, and increase the maximum length by five for each epoch until we reach the maximum sentence length (60 for Dutch and 40 for German). We use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.75, β 2 = 0.999, learning rate 0.002, batch size 20, and a maximum gradient norm limit of 3. We train for 20 epochs and perform early stopping strategy based on the performance of development set with maximum patience 5. Table 7 shows the maximum performance across four seeds, while Table 8 gives the F1 broken down by sentence length on TIGER.

D Additional example trees
We show some additional trees on German in Fig. 4 and on Dutch in Fig. 5.