Neural Bi-Lexicalized PCFG Induction

Neural lexicalized PCFGs (L-PCFGs) have been shown effective in grammar induction. However, to reduce computational complexity, they make a strong independence assumption on the generation of the child word and thus bilexical dependencies are ignored. In this paper, we propose an approach to parameterize L-PCFGs without making implausible independence assumptions. Our approach directly models bilexical dependencies and meanwhile reduces both learning and representation complexities of L-PCFGs. Experimental results on the English WSJ dataset confirm the effectiveness of our approach in improving both running speed and unsupervised parsing performance.


Introduction
Probabilistic context-free grammars (PCFGs) has been an important probabilistic approach to syntactic analysis (Lari and Young, 1990;Jelinek et al., 1992). They assign a probability to each of the parses admitted by CFGs and rank them by the plausibility in such a way that the ambiguity of CFGs can be ameliorated. Still, due to the strong independence assumption of CFGs, vanilla PCFGs (Charniak, 1996) are far from adequate for highly ambiguous text.
A common premise for tackling the issue is to incorporate lexical information and weaken the independence assumption. There have been many approaches proposed under the premise (Magerman, 1995;Collins, 1997;Johnson, 1998;Klein and Manning, 2003). Among them lexicalized PCFGs (L-PCFGs) are a relatively straightforward formalism (Collins, 2003). L-PCFGs extend PCFGs by associating a word, i.e., the lexical head, with each grammar symbol. They can thus exploit lexical Corresponding Author information to disambiguate parsing decisions and are much more expressive than vanilla PCFGs. However, they suffer from representation and inference complexities. For representation, the addition of lexical information greatly increases the number of parameters to be estimated and exacerbates the data sparsity problem during learning, so the expectation-maximisation (EM) based estimation of L-PCFGs has to rely on sophisticated smoothing techniques and factorizations (Collins, 2003). As for inference, the CYK algorithm for L-PCFGs has a Opl 5 |G|q complexity, where l is the sentence length and |G| is the grammar constant. Although Eisner and Satta (1999) manage to reduce the complexity to Opl 4 |G|q, inference with L-PCFGs is still relatively slow, making them less popular nowadays.
Recently, Zhu et al. (2020) combine the ideas of factorizing the binary rule probabilities (Collins, 2003) and neural parameterization (Kim et al., 2019) and propose neural L-PCFGs (NL-PCFGs), achieving good results in both unsupervised dependency and constituency parsing. Neural parameterization is the key to success, which facilitates informed smoothing (Kim et al., 2019), reduces the number of learnable parameters for large grammars (Chiu and Rush, 2020;Yang et al., 2021) and facilitates advanced gradient-based optimization techniques instead of using the traditional EM algorithm (Eisner, 2016). However, Zhu et al. (2020) oversimplify the binary rules to decrease the complexity of the inside/CYK algorithm in learning (i.e., estimating the marginal sentence loglikelihood) and inference. Specifically, they make a strong independence assumption on the generation of the child word such that it is only dependent on the nonterminal symbol. Bilexical dependencies, which have been shown useful in unsupervised dependency parsing (Han et al., 2017;Yang et al., 2020), are thus ignored.
To model bilexical dependencies and meanwhile reduce complexities, we draw inspiration from the canonical polyadic decomposition (CPD) (Kolda and Bader, 2009) and propose a latent-variable based neural parameterization of L-PCFGs. Cohen et al. (2013); Yang et al. (2021) have used CPD to decrease the complexities of PCFGs, and our work can be seen as an extension of their work to L-PCFGs. We further adopt the unfold-refold transformation technique (Eisner and Blatz, 2007) to decrease complexities. By using this technique, we show that the time complexity of the inside algorithm implemented by Zhu et al. (2020) can be improved from cubic to quadratic in the number of nonterminals m. The inside algorithm of our proposed method has a linear complexity in m after combining CPD and unfold-refold. We evaluate our model on the benchmarking Wall Street Journey (WSJ) dataset. Our model surpasses the strong baseline NL-PCFG (Zhu et al., 2020) by 2.9% mean F1 and 1.3% mean UUAS under CYK decoding. When using the Minimal Bayes-Risk (MBR) decoding, our model performs even better. We provide an efficient implementation of our proposed model at https://github.com/ sustcsonglin/TN-PCFG.

Lexicalized CFGs
We first introduce the formalization of CFGs. A CFG is defined as a 5-tuple G " pS, N , P, Σ, Rq where S is the start symbol, N is a finite set of nonterminal symbols, P is a finite set of preterminal symbols, 1 Σ is a finite set of terminal symbols, and R is a set of rules in the following form: N , P and Σ are mutually disjoint. We will use 'nonterminals' to indicate N Y P when it is clear from the context. Lexicalized CFGs (L-CFGs) (Collins, 2003) extend CFGs by associating a word with each of the 1 An alternative definition of CFGs does not distinguish nonterminals N (constituent labels) from preterminals P (partof-speech tags) and treats both as nonterminals. nonterminals: where w p , w q P Σ are the headwords of the constituents spanned by the associated grammar symbols, and p, q are the word positions in the sentence. We refer to A, a parent nonterminal annotated by the headword w p , as head-parent. In binary rules, we refer to a child nonterminal as head-child if it inherits the headword of the headparent (e.g., Brw p s) and as non-head-child otherwise (e.g., Crw q s). A head-child appears as either the left child or the right child. We denote the head direction by D P tð, ñu, where ð means head-child appears as the left child.

Grammar induction with lexicalized probabilistic CFGs
Lexicalized probabilistic CFGs (L-PCFGs) extend L-CFGs by assigning each production rule r " A Ñ γ a scalar π r such that it forms a valid categorical probability distribution given the left hand side A. Note that preterminal rules always have a probability of 1 because they define a deterministic generating process. Grammar induction with L-PCFGs follows the same way of grammar induction with PCFGs. As with PCFGs, we maximize the log-likelihood of each observed sentence w " w 1 , . . . , w l : where pptq " ś rPt π r and T G L pwq consists of all possible lexicalized parse trees of the sentence w under an L-PCFG G L . We can compute the marginal ppwq of the sentence by using the inside algorithm in polynomial time. The core recursion of the inside algorithm is formalized in Equation 3. It recursively computes the probability s A,p i,j of a head-parent Arw p s spanning the substring w i , . . . , w j´1 (p P ri, j´1s). Term A1 and A2 in Equation 3 cover the cases of the head-child as the left child and the right child respectively.  Zhu et al. (2020): W q is independent with B, D, A, W p given C. (c) Our proposed parameterization. We slightly abuse the Bayesian network notation by grouping variables. In the standard notation, there would be arcs from the parent variables to each grouped variable as well as arcs between the grouped variables.
is up to |Σ| times the number of nonterminals in PCFGs. As the grammar size is largely determined by the number of binary rules and increases approximately in cubic of the nonterminal number, representing L-PCFGs has a high space complexity Opm 3 |Σ| 2 q (m is the nonterminal number). Specifically, it requires an order-6 probability tensor for binary rules with each dimension representing A, B, C, w p , w q , and head direction D, respectively. With so many rules, L-PCFGs are very prone to the data sparsity problem in rule probability estimation. Collins (2003) suggests factorizing the binary rule probabilities according to specific independence assumptions, but his approach still relies on complicated smoothing techniques to be effective. The addition of lexical heads also scales up the computational complexity of the inside algorithm by a factor Opl 2 q and brings it up to Opl 5 m 3 q. Eisner and Satta (1999) point out that, by changing the order of summations in Term A1 (A2) of Equation 3, one can cache and reuse Term B1 (B2) in Equation 4 and reduce the computational complexity to Opl 4 m 2`l3 m 3 q. This is an example application of unfold-refold as noted by Eisner and Blatz (2007). However, the complexity is still cubic in m, making it expensive to increase the total number of nonterminals.

Neural L-PCFGs
Zhu et al. (2020) apply neural parameterization to tackle the data sparsity issue and to reduce the total learnable parameters of L-PCFGs. Considering the head-child as the left child (similarly for the other case), they further factorize the binary rule probability as: Bayesian networks representing the original probability and the factorization are illustrated in can be rewritten as Equation 5. Zhu et al. (2020) implement the inside algorithm by caching Term C1-1 in Equation 6, resulting in a time complexity Opl 4 m 3`l3 mq, which is cubic in m. We note that, we can use unfold-refold to further cache Term C1-2 in Equation 6 and reduce the time complexity of the inside algorithm to Opl 4 m 2`l3 m`l 2 m 2 q, which is quadratic in m.
Although the factorization of Equation 2 reduces the space and time complexity of the inside algorithm of L-PCFG, it is based on the independence assumption that the generation of w q is independent of A, B, D and w p given the non-head-child C. This assumption can be violated in many scenarios and hence reduces the expressiveness of the grammar. For example, suppose C is Noun, then even if we know B is Verb, we still need to know D to determine if w q is an object or a subject of the verb, and then need to know the actual verb w p to pick a likely noun as w q .

Factorization with latent variable
Our main goal is to find a parameterization that removes the implausible independence assumptions of Zhu et al. (2020) while decreases the complexities of the original L-PCFGs.
To reduce the representation complexity, we draw inspiration from the canonical polyadic decomposition (CPD). CPD factorizes an n-th order tensor into n two-dimensional matrices. Each matrix consists of two dimensions: one dimension comes from the original n-th order tensor and the other dimension is shared by all the n matrices. The shared dimension can be marginalized to recover the original n-th order tensor. From a probabilistic perspective, the shared dimension can be regarded as a latent-variable. In the spirit of CPD, we introduce a latent-variable H to decompose the order-6 probability tensor ppB, C, D, w q |A, w p q. Instead of fully decomposing the tensor, we empirically find that binding some of the variables leads to better results. Our best factorization is as follows (also illustrated by a Bayesian network in Figure 1 (c)): According to d-separation (Pearl, 1988), when A and w p are given, B, C, w q , and D are interdependent due to the existence of H. In other words, our factorization does not make any independence assumption beyond the original binary rule. The domain size of H is analogous to the tensor rank in CPD and thus influences the expressiveness of our proposed model. Based on our factorization approach, the binary rule probability is factorized as We also follow Zhu et al. (2020) and factorize the start rule as follows.
(12) Choices of factorization: If we follow the intuition of CPD, then we shall assume that B, C, D, and w q are all independent conditioned on H. However, properly relaxing this strong assumption by binding some variables could benefit our model. Though there are many different choices of binding the variables, some bindings can be easily ruled out. For instance, binding B and C inhibits us from caching Term D1-1 and Term D1-2 in Equation 7 and thus we cannot implement the inside algorithm efficiently; binding C and w q leads to a high computational complexity because we will have to compute a high-dimensional (m|Σ|) categorical distribution. In Section 6.3, we make an ablation study on the impact of different choices of factorizations.
Neural parameterizations: We follow Kim et al. (2019) and Zhu et al. (2020) and define the following neural parameterization: 4 Experimental setup

Dataset
We conduct experiments on the Wall Street Journal (WSJ) corpus of the Penn Treebank (Marcus et al., 1994). We use the same preprocessing pipeline as in Kim et al. (2019). Specifically, punctuation is removed from all data splits and the top 10,000 frequent words in the training data are used as the vocabulary. For dependency grammar induction, we follow (Zhu et al., 2020) to use the Stanford typed dependency representation (de Marneffe and Manning, 2008).

Hyperparameters
We optimize our model using the Adam optimizer with β 1 " 0.75, β 2 " 0.999, and learning rate 0.001. All parameters are initialized with Xavier uniform initialization. We set the dimension of all embeddings to 256 and the ratio of the nonterminal number to the preterminal number to 1:2. Our best model uses 15 nonterminals, 30 preterminals, and d H " 300. We use grid search to tune the nonterminal number (from 5 to 30) and domain size d H of the latent H (from 50 to 500).

Evaluation
We run each model four times with different random seeds and for ten epochs. We train our models on training sentences of length ď 40 with batch size 8 and test them on the whole testing set. For each run, we perform early stopping and select the best model according to the perplexity of the development set. We use two different parsing methods: the variant of CYK algorithm (Eisner and Satta, 1999) and Minimum Bayes-Risk (MBR) decoding (Smith and Eisner, 2006). 2 For constituent grammar induction, we report the means and standard deviations of sentence-level F1 scores. 3 For dependency grammar induction, we report unlabeled directed attachment score (UDAS) and unlabeled undirected attachment score (UUAS).

Main result
We present our main results in Table 2. Our model is referred to as Neural Bi-Lexicalized PCFGs (NBL-PCFGs). We mainly compare our approach against recent PCFG-based models: neural PCFG (N-PCFG) and compound PCFG (C-PCFG) (Kim et al., 2019), tensor decomposition based neural PCFG (TN-PCFG) (Yang et al., 2021) and neural L-PCFG (NL-PCFG) (Zhu et al., 2020). We report both official result of Zhu et al. (2020) and our reimplementation. We do not use the compound trick (Kim et al., 2019) in our implementations of lexicalized PCFGs because we empirically find that using it results in unstable training and does not necessarily bring performance improvements.
We draw three key observations: (1) Our model achieves the best F1 and UUAS scores under both CYK and MBR decoding. It is also comparable to the official NL-PCFG in the UDAS score.
(2) When we remove the compound parameterization from NL-PCFG, its F1 score drops slightly while its UDAS and UUAS scores drop dramatically. It implies that compound parameterization is the key to achieve excellent dependency grammar induction performance in NL-PCFG. (3) The MBR decoding outperforms CYK decoding.
Regarding UDAS, our model significantly outperforms NL-PCFGs in UDASs if compound parameterization is not used (37.1 vs. 23.8 with CYK decoding), showing that explicitly modeling bilexical relationship is helpful in dependency grammar induction. However, when compound parameterization is used, the UDAS of NL-PCFGs is greatly improved, slightly surpassing that of our model. We believe this is because compound parameterization greatly weakens the independence assumption of NL-PCFGs (i.e., the child word is dependent on C only) by leaking bilexical information via the global sentence embedding. On the other hand, NBL-PCFGs are already expressive enough and thus compound parameterization brings no further increase of their expressiveness but makes learning more difficult.

Analysis
In the following experiments, we report results using MBR decoding by default. We also use d H " 300 by default unless otherwise specified.  ited expressiveness of NBL-PCFGs. When d H is larger than 300, the perplexity becomes plateaued and the F1 score starts to decrease possibly because of overfitting.

Influence of nonterminal number
Figure 2b illustrates perplexities and F1 scores with the increase of the nonterminal number and fixed d H " 300 (plots of UDAS and UUAS can be found in Appendix). We observe that increasing the nonterminal number has only a minor influence on NBL-PCFGs. We speculate that it is because the number of word-annotated nonterminals (m|Σ|) is already sufficiently large even if m is small. On the other hand, the nonterminal number has a big influence on NL-PCFGs. This is most likely because NL-PCFGs make the independence assumption that the generation of w q is solely determined by the non-head-child C and thus require more nonterminals so that C has the capacity of conveying information from A, B, D and w p . Using more nonterminals (ą 30) seems to be helpful for NL-  PCFGs, but would be computationally too expensive due to the quadratically increased complexity in the number of nonterminals. Table 3 presents the results of our models with the following bindings:

Influence of different variable bindings
• D-alone: D is generated alone.
• D-w q : D is generated with w q .
• D-B: D is generated with head-child B.
• D-C: D is generated with non-head-child C.
Clearly, binding D and C (the default setting for NBL-PCFG) results in the lowest perplexity and the highest F1 score. Binding D and w q has a surprisingly good performance in unsupervised dependency parsing.
We find that how to bind the head direction has a huge impact on the unsupervised parsing performance and we give the following intuition. Usually given a headword and its type, the children generated in each direction would be different. So, D is intuitively more related to w q and C than to B. On the other hand, B is dependent more on the headword instead. In Table 3 we can see that (D-B) has a lower UDAS score than (D-C) and (D-w q ), which is consistent with this intuition. Notably, in Zhu et al. (2020), their Factorization III has a significantly lower UDAS than the default model (35.5 vs. 25.9), and the only difference is whether the generation of C is dependent on the head direction. This is also consistent with our intuition.

Qualitative analysis
We analyze the parsing performance of different PCFG extensions by breaking down their recall numbers by constituent labels (see Table 4). NPs and VPs cover most of the gold constituents in WSJ test set. TN-PCFGs have the best performance in predicting NPs and NBL-PCFGs have better performance in predicting other labels on average. We further analyze the quality of our induced trees. Our model prefers to predict left-headed constituents (i.e., constituents headed by the leftmost word). VPs are usually left-headed in English, so our model has a much higher recall on VPs and correctly predicts their headwords. SBARs often start with which and that and PPs often start with prepositions such as of and for. Our model often relies on these words to predict the correct constituents and hence erroneously predicts these words as the headwords, which hurts the dependency accuracy. For NPs, we find our model often makes mistakes in predicting adjective-noun phrases. For example, the correct parse of a rough market is (a (rough market)), but our model predicts ((a rough) market) instead.

Discussion on dependency annotation schemes
What should be regarded as the headwords is still debatable in linguistics, especially for those around function words (Zwicky, 1993). For example, in phrase the company, some linguists argue that the should be the headword (Abney, 1972). These disagreements are reflected in the dependency annotation schemes. Researchers have found that different dependency annotation schemes result in very different evaluation scores of unsupervised dependency parsing (Noji, 2016;Shen et al., 2020).
In our experiments, we use the Stanford Dependencies annotation scheme in order to compare with NL-PCFGs. Stanford Dependencies prefers to select content words as headwords. However, as we discussed in previous sections, our model prefers to select function words (e.g., of, which, for) as headwords for SBARs or PPs.This explains why our model can outperform all the baselines on constituency parsing but not on dependency parsing (as judged by Stanford Dependencies) at the same time. Table 3 shows that there is a trade-off between the F1 score and UDAS, which suggests that adapting our model to Stanford Dependencies would hurt its ability to identify constituents.

Speed comparison
In practice, the forward and backward pass of the inside algorithm consumes the majority of the running time in training a N(B)L-PCFG. The existing implementation by Zhu et al. (2020) 4 does not employ efficient parallization and has a cubic time    Yang et al. (2021) complexity in the number of nonterminals. We provide an efficient reimplementation (we follow Zhang et al. (2020) to batchify) of the inside algorithm based on Equation 6. We refer to an implementation which caches Term C1-1 as re-impl-1 and refer to an implementation which caches Term C1-2 as re-impl-2. We measure the time based on a single forward and backward pass of the inside algorithm with batch size 1 on a single Titan V GPU. Figure 3a illustrates the time with the increase of the sentence length and a fixed nonterminal number of 10. The original implementation of NL-PCFG by Zhu et al. (2020) takes much more time when sentences are long. For example, when sentence length is 40, it needs 6.80s, while our fast implementation takes 0.43s and our NBL-PCFG takes only 0.30s. Figure  3b illustrates the time with the increase of the non-terminal number m and a fixed sentence length of 30. The original implementation runs out of 12GB memory when m " 30. re-impl-2 is faster than re-impl-1 when increasing m as it has a better time complexity in m (quadratic for re-impl-2, cubic for re-impl-1). Our NBL-PCFGs have a linear complexity in m, and as we can see in the figure, our NBL-PCFGs are much faster when m is large.

Related Work
Unsupervised parsing has a long history but has regained great attention in recent years. In unsupervised dependency parsing, most methods are based on Dependency Model with Valence (DMV) (Klein and Manning, 2004). Neurally parameterized DMVs have obtained state-of-the-art performance (Jiang et al., 2016;Han et al., 2017Han et al., , 2019Yang et al., 2020). However, they rely on gold POS tags and sophisticated initializations (e.g. K&M initialization or initialization with the parsing result of another unsupervised model).  propose a left-corner parsing-based DMV model to limit the stack depth of center-embedding, which is insensitive to initialization but needs gold POS tags. He et al. (2018) propose a latent-variable based DMV model, which does not need gold POS tags but requires good initialization and high-quality induced POS tags. See  for a survey of unsupervised dependency parsing. Compared to these methods, our method does not require gold/induced POS tags or sophisticated initializations, though its performance lags behind some of these previous methods.
(3) Syntactic distance-based methods (Shen et al., 2018(Shen et al., , 2019(Shen et al., , 2020. They encode hidden syntactic trees into syntactic distances and inject them into language models. (4) Probing based methods . They extract phrase-structure trees based on the attention distributions of large pre-trained language models. In addition to these methods, Cao et al. (2020) use constituency tests and Shi et al. (2021) make use of naturally-occurring bracketings such as hyperlinks on webpages to train parsers. Multimodal information such as images (Shi et al., 2019;Zhao and Titov, 2020;Jin and Schuler, 2020) and videos (Zhang et al., 2021) have also been exploited for unsupervised constituency parsing.
We are only aware of a few previous studies in unsupervised joint dependency and constituency parsing. Klein and Manning (2004) propose a joint DMV and CCM (Klein and Manning, 2002) model. Shen et al. (2020) propose a transformer-based method, in which they define syntactic distances to guild attentions of transformers. Zhu et al. (2020) propose neural L-PCFGs for unsupervised joint parsing.

Conclusion
We have presented a new formalism of lexicalized PCFGs. Our formalism relies on the canonical polyadic decomposition to factorize the probability tensor of binary rules. The factorization reduces the space and time complexity of lexicalized PCFGs while keeping the independence assumptions encoded in the original binary rules intact. We further parameterize our model by using neural networks and present an efficient implementation of our model. On the English WSJ test data, our model achieves the lowest perplexity, outperforms all the existing extensions of PCFGs in constituency grammar induction, and is comparable to strong baselines in dependency grammar induction. h i pxq " g i,1 pg i,2 pW i xqq g i,j pyq " ReLU pV i,j ReLU pU i,j yqq`y f prx, ysq " h 4 pReLUpWrx; ysq`yq B Influence of the domain size of H and the number of nonterminals Figure 4 illustrates the change of UUAS and UDAS with the increase of d H . We find similar tendencies compared to the change of F1 scores and perplexities with the increase of d H . d H " 300 performs best. Figure 5 illustrates the change of UUAS and UDAS when increasing the number of nonterminals. We can see that NL-PCFGs benefit from using more nonterminals while NBL-PCFGs have a better performance when the number of nonterminals is relatively small.