Simple Hardware-Efficient PCFGs with Independent Left and Right Productions

Scaling dense PCFGs to thousands of nonterminals via a low-rank parameterization of the rule probability tensor has been shown to be beneficial for unsupervised parsing. However, PCFGs scaled this way still perform poorly as a language model, and even underperform similarly-sized HMMs. This work introduces \emph{SimplePCFG}, a simple PCFG formalism with independent left and right productions. Despite imposing a stronger independence assumption than the low-rank approach, we find that this formalism scales more effectively both as a language model and as an unsupervised parser. As an unsupervised parser, our simple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language model, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank PCFGs. We further introduce \emph{FlashInside}, a hardware IO-aware implementation of the inside algorithm for efficiently scaling simple PCFGs.


Introduction
Despite the improvements in unsupervised parsing obtained through scaling neural probabilistic context-free grammars (PCFGs), their language model performance scales less favorably compared to, for example, hidden Markov models (HMMs) and neural language models.On the Penn Treebank, a neural PCFG with 30 nonterminals and 60 preterminals obtains ≈ 250 perplexity (Kim et al., 2019), and while scaling neural PCFGs to thousands of states via a low-rank parameterization can improve perplexity to ≈ 170 (Yang et al., 2022), this still lags behind a similarly-sized HMM, which obtains ≈ 130 perplexity (Chiu et al., 2021), despite the fact that HMMs are a subclass of PCFGs This work proposes SimplePCFG, a simple PCFG formalism with independent left and right productions.We find that this simple PCFG scales more effectively (in terms of both language modeling and unsupervised parsing) than previous approaches which scale PCFGs by factorizing the rule probability tensor into low-rank components (Yang et al., 2021b(Yang et al., , 2022)).In particular, we find that simple PCFGs can obtain significantly lower perplexity in language modeling while achieving higher unsupervised parsing performance compared to lowrank PCFGs with a similar number of nonterminals, achieving a near state-of-the-art unsupervised parsing performance on the Penn Treebank with an F1 of 65.1.We further describe a hardware-efficient IO-aware implementation of the inside algorithm, dubbed FlashInside, to facilitate scalable learning of simple PCFGs.

Simple PCFGs
A PCFG can be defined by a 6-tuple G = (S, N , P, Σ, R, π), where S is the distinguished start symbol, N /P/Σ are a finite set of nonterminal/pre-terminal/terminal symbols,1 R is a set of production rules of the form, and π : R → [0, 1] maps rules to their associated probabilities.In simple PCFGs, we decompose π A→BC into π B↶A • π A↷C , effectively assuming that left and right children are generated independently. 2 We denote L, R ∈ R |N |×|N | as the matrix representation of π B↶A and π A↷C , and apply a neural parameterization over these matrices to compute the rule probabilities (Kim et al., 2019).See Appendix A for details.

WU T
Figure 1: Bayesian network-like representations of PCFG binary rules: (a) original grammar, (b) after tensor decomposition (Yang et al., 2021b), and (c) rank space grammar (Yang et al., 2022).Our simple PCFG is almost the same as (c) but uses a flexible parameterization.
Comparing simple vs. low-rank PCFGs.The previous approach to scaling HMMs and PCFGs to thousands of nontermals is parameterizing the rule probability tensor T ∈ R |N |×|N |×|N | to be lowrank (Chiu et al., 2021;Yang et al., 2021bYang et al., , 2022)).Low-rank PCFGs can be viewed as introducing a new latent variable, namely a "rank variable" R, to decompose π A→BC into R π A→R π B↶R π R↷C , as shown in Fig. 1, where the tensor/matrix representations of π A→BC , π A→R , π B↶R , π R↷C are T, U, V, W, respectively.Yang et al. (2022, Sect. 4.2) show that a low-rank PCFG can be reparameterized as a simple PCFG with independent left/right productions by marginalizing nonterminal variables and viewing the rank variables as new nonterminal variables.As such, low-rank PCFGs parameterize L, R in a more restrictive manner: L = VU T , R = WU T .We speculate that the shared U T would restrict the expressiveness of lowrank PCFGs and thus hinder optimization, which motivates our simple PCFGs.
3 A Hardware-efficient Inside Algorithm

The inside algorithm for simple PCFGs
The inside algorithm for simple PCFGs has the following recursive formula: Vector form.We abuse the notation to have where ⊙ is the element-wise product.

FlashInside
It is necessary to implement the inside algorithm on GPUs efficiently to facilitate scaling of simple PCFGs.We introduce FlashInside, a hardwareefficient IO-aware implementation of the inside algorithm in the spirit of FlashAttention (Dao et al., 2022).FlashInside comprises of four main techniques: Span-level parallelism.Given the span width w, the inside probability vector β i(i+w) could be computed in parallel for different starting position i (Yi et al., 2011, Sect. 4.2).
The log-einsum-exp trick.To improve numerical stability, it is common to use the "log-sum-exp" trick.For example, letting Using log-sum-exp could be expensive when computing a ij and b ij , so we resort to the "log-einsum-exp" trick (Peharz et al., 2020, Sect. 3.2), . 3This allows us to leverage matrix multiplication operators, which are highly optimized on GPUs, to compute Kernel fusion.The above computation involves many element-wise operations and is thus memorybounded.Loading and storing these vectors multiple times would cause significant IO-cost (Dao et al., 2022).We reduce the IO-cost by fusing these operations whenever possible.Concretely, when followed by fused element-wise log and addition operations.
Recomputation.While it possible to rely on automatic differentiation (AD) to backpropagate through the inside algorithm (Eisner, 2016), this can be memory-inefficient since AD would save all the intermediate results in the DP computation, which are not needed.For example, in Eq. 1 the partial differentiation between o ij and a ik , b kj is given by, In the backward pass, we could recompute exp(a ik + b kj − o ij ) without the need to store exp(a ik + b kj − x ⋆ ) in the forward pass, thus saving memory4 .We found that this manual backpropagation led to a slight decrease in running speed but greatly increased memory savings, and thus use it for all our experiments.parallelism (e.g., in Torch-Struct (Rush, 2020)).

Speed comparison
We can see that the use of log-einsum-exp trick significantly accelerate the running speed and reduce the memory footprint.FlashInside uses the kernel fusion and recomputation techniques in addition, resulting in further improvement, especially on larger grammars and longer sentences.
towards scaling PCFGs, despite the strict independence assumption.Table 4 and 3 show the unsupervised parsing performance.SN-PCFG consistently outperforms Rank PCFG in S-F1 while obtaining much lower perplexity.We also experiment with the compound version of simple PCFGs (SC-PCFG) which uses an auxiliary sentence-level vector to model sentence-level properties and uses variational infer-ence for learning (see the appendix for the full parameterization).We find that SN-PCFG performs better on English while SC-PCFG achieves the best parsing performance in languages other than English.We remark that the compound parameterization is reported to be not compatible with low-rank parameterization probably due to optimization issues (Yang et al., 2021b).This work successfully scales compound PCFGs to thousands of states, which could be useful in some settings such as multimodal grammar induction which condition on vector representations of side information (Zhao and Titov, 2020;Jin and Schuler, 2020;Zhang et al., 2021Zhang et al., , 2022;;Li et al., 2022).
Simple PCFG vs. Neural PCFG.Despite the better scalablity of simple PCFGs, we find that under the same number of nonterminal (i.e., 128), SN-PCFG expectedly underperforms N-PCFG in both language modeling and unsupervised parsing (Table 4) due to the stronger independence assumption that is necessary for scaling.Nevertheless, N-PCFG does not scale well and (for example) runs into memory issues even with just 256 nonterminals, while SN-PCFG can scale to 8192 nonterminals on a single A40 GPU.
Simple PCFG vs. Rank PCFG.Recall that the rank PCFG and simple PCFG share an identical dynamic programming structure.The rank variable in the rank PCFG amounts to the nonterminal variable in the simple PCFG.Consequently, if we align the rank size in the rank PCFG with the nonterminal size in the simple PCFG, we achieve parity in terms of memory footprint and computational speed within the dynamic programming computation.In our experiments, we opt for a rank size of 4096 in the low-rank PCFG.The results, as presented in tables 2-4, showcase the worse perfor-mance of the rank PCFG when compared to SN-PCFG with 4096 nonterminals.Interestingly, this work was motivated by our observation that merely augmenting the rank size of PCFG falls short in bridging the performance gap between HMMs and PCFGs in language modeling.This resulted in our exploring alternative parameterizations, culminating in the straightforward independent left/right productions based parameterization which yields superior results in both language modeling and unsupervised parsing.

Related Work
Independence assumptions are frequently made in grammar learning for tractability and scalability.Simple PCFGs assume independent generation of left and right children, thus resembling splithead dependency grammars (Eisner, 1996;Collins, 1997;Eisner and Satta, 1999;Paskin, 2001;Klein and Manning, 2004).We have shown that trading expressiveness (of grammar formalism) for scalablity is beneficial, and this idea could be applied to other complex grammar formalism of high parsing complexity, such as mildly context-sensitive grammars (Yang et al., 2023), synchronized grammars (Kim, 2021;Wang et al., 2022;Friedman et al., 2022;Lou and Tu, 2023) and lexicalized grammars (Zhu et al., 2020;Yang et al., 2021a).

Conclusion
In this work we explore a simpler variant of PCFGs (SimplePCFG) that shows better scaling properties than previous approaches in terms of both language modeling and unsupervised parsing performance.We also introduce a hardware-aware version of the inside algorithm (FlashInside) which improves over existing vectorized GPU implementations.

Limitations
We have successfully bridged the gap between HMMs and PCFGs in language modeling.However, a significant disparity remains between PCFGs and neural models like Transformers.While we recognize the potential of our hardwareefficient inside algorithm implementation for conducting large-scale language modeling experiments, our aim is not to position PCFGs as direct rivals to neural models, given the intrinsic limitations arising from PCFG's strong context-free independence assumption.Our main objective is to enhance unsupervised PCFG learning, with a cen-tral focus on optimizing the sentence log marginal likelihood objective function.
Simple PCFGs, due to their restrictive grammar nature, require many nonterminals for optimal performance.However, we observe diminishing returns while scaling up simple PCFGs.This phenomena is common in scaling up latent-variable models and future work might consider leveraging the technique from Liu et al. (2023) to mitigate this issue.
When scaling up simple PCFGs, the computation of grammar rule probabilities could also be expensive, especially when constructing the emission probability matrix of size R |P|×|V| .Compound parameterization exacerbates this issue since each sentence will have its own set of grammar rule probabilities.Consequently, we only used up to 2048 nonterminals in our SC-PCFG experiments.

Table 2 :
(Chiu et al., 2021;Yang et al., 2022)footprint measured under a single NVIDIA-A40 GPU, where we compare against the standard log-sum-exp implementation of the inside algorithm which only leverages span-level Results on the PTB language modeling split fromMikolov et al. (2011).NT denotes the number of nonterminals and ppl denotes perplexity.Top results are from previous papers(Chiu et al., 2021;Yang et al., 2022), while the bottom results are from the current work.Our runs are averaged over 4 seeds.

Table 3 :
Results on the Chinese, French, and German treebanks.All runs are averaged over 4 seeds.
Table2shows the language modeling performance on PTB.SN-PCFG obtains significantly lower perplexity than Rank PCFG, and outperforms similarly-sized HMMs.This indicates that simple PCFGs provide a viable path