Approximating CKY with Transformers

We investigate the ability of transformer models to approximate the CKY algorithm, using them to directly predict a parse and thus avoid the CKY algorithm's cubic dependence on sentence length. We find that on standard constituency parsing benchmarks this approach achieves competitive or better performance than comparable parsers that make use of CKY, while being faster. We also evaluate the viability of this approach for parsing under random PCFGs. Here we find that performance declines as the grammar becomes more ambiguous, suggesting that the transformer is not fully capturing the CKY computation. However, we also find that incorporating additional inductive bias is helpful, and we propose a novel approach that makes use of gradients with respect to chart representations in predicting the parse, in analogy with the CKY algorithm being the subgradient of a partition function variant with respect to the chart.


Introduction
Parsers based on transformers (Vaswani et al., 2017) currently represent the state of the art in constituency parsing (Mrini et al., 2020;Tian et al., 2020), and recent work (Tenney et al., 2019;Jawahar et al., 2019;Li et al., 2020;Murty et al., 2022;Eisape et al., 2022;Zhao et al., 2023) has found that transformers are capable of learning constituentlike representations of spans of text.Given these successes, it is natural to wonder whether transformers capture the algorithmic processes we associate with constituency parsing, such as the CKY algorithm (Kasami, 1966;Younger, 1967;Baker, 1979).Indeed, one might suspect that the layers of a transformer are building up phrase-level representations, much as the CKY algorithm itself builds up its chart.Such a hypothesis has become particularly compelling in light of recent work studying the ability of transformers, and graph neural networks more generally, to implement or approximate classical algorithms (Xu et al., 2019;Csordás et al., 2021;Dudzik and Veličković, 2022;Delétang et al., 2022, inter alia).
If standard transformers were indeed approximating CKY, there would be several implications.Practically, such a finding might lead to faster neural parsers.Whereas state-of-the-art parsers (e.g., that of Kitaev and Klein (2018) and followup work (Kitaev et al., 2019;Tian et al., 2020)) tend to implement CKY on top of transformerrepresentations, thus incurring a parsing timecomplexity that is cubic in the sentence length, we could conceivably extract a parse from simply running a transformer over the sentence; this would involve a time-complexity cost that is only quadratic in the sentence length.Moreover, since there is significant academic and industrial effort aimed at making standard transformers faster (e.g., that of Dao et al. (2022)), this progress could conceivably transfer automatically to the parsing case.
In addition to more practical considerations, the task of producing CKY parses with transformers provides an excellent opportunity for investigating whether endowing transformers with additional inductive bias can help them in implementing classical algorithms, a topic recently studied by Csordás et al. (2021) and others.The results of such an investigation would also bear on recent results relating to the computational power of transformers trained in the standard way (Delétang et al., 2022;Liu et al., 2023).
Accordingly, we first show that having a pretrained transformer simply predict an entire parse by independently labeling each span -an approach similar to that taken in a different context by Corro (2020) only at training time -is sufficient to obtain competitive or better performance on standard constituency parsing benchmarks, while being significantly faster.
While these constituency parsing results are en-1 couraging, they do not imply that trained transformers are implementing the CKY algorithm, because the transformer may simply be predicting a parse without it being highest-scoring under some grammar.We accordingly go on to investigate transformers' performance in predicting a CKY parse under randomly generated PCFGs.Given a randomly generated PCFG, it is of course easy to check whether a predicted parse is indeed highestscoring.In this setting we find that the performance of transformers negatively correlates with the ambiguity of the PCFG, suggesting that they are not in fact implementing something CKY-like.At the same time, we find that incorporating additional inductive bias into the standard architecture is helpful, and we propose a novel approach, which makes use of gradients with respect to chart representations in predicting the parse.This inductive bias is inspired by the fact that the CKY algorithm can be viewed as computing the gradient of the "max score" partition function (Eisner, 2016;Rush, 2020), and we find that this improves performance on random PCFGs significantly.
In summary, we show that: • using transformers to directly predict parses performs competitively with explicit CKYbased approaches, while being faster; • this impressive performance is likely not due to transformers implicitly implementing the CKY algorithm, as they fail to accurately parse ambiguous synthetic PCFGs; • biasing the model to produce parses from gradients with respect to the chart signifcantly improves synthetic PFCG parsing performance.

Background and Notation
The CKY algorithm (Kasami, 1966;Younger, 1967;Baker, 1979) computes a highest scoring parse of a sentence under a probabilistic contextfree grammar (PCFG).Let G = (N , Σ, R, S, W) be a PCFG, with N the set of non-terminals, Σ the set of terminals, R the set of rules, S a start non-terminal, and W a set of probabilities per rule (which normalize over left-hand-sides).Given a sentence x ∈ Σ T , the CKY algorithm uses a dynamic program to compute a highest scoring parse of x under G, and it requires O(|R|T 3 ) computation time.
Following the notation in Rush (2020), let ℓ R ∈ R |R|×|N |×|N | represent the log potentials (e.g., log probabilities) associated with the rules in PCFG G, and let ℓ E (x) ∈ R T ×|N | represent the log potentials corresponding to each token in input sentence x.We can use these log potentials to compute the chart β ∈ R T ×T ×|N | for x, where β[i, j, a] represents the sum (under a particular semiring) of all weight associated with the a-th non-terminal yielding x i:j .These log potentials can also be used to compute a highest-scoring parse, which we will refer to as β * ∈ {0, 1} T ×T ×|N | ; this is the one-hot representation of a parse, with β * [i, j, a] = 1 iff j ≥ i and nonterminal a yields x i:j in the parse.
CKY as a subgradient Recent work (Eisner, 2016;Rush, 2020) has emphasized that inference algorithms such as CKY may be fruitfully viewed as a special case of calculating the gradient or a subgradient of a generalized log partition function with respect to (some function of) the log potentials associated with an input.In particular, the CKY algorithm can be viewed as computing a subgradient of the "max score" variant of the log partition function with respect to an input sentence x's chart; see Appendix A for details.Thus, β * is relatively easily obtained with automatic differentiation frameworks.However, computing it in this way is exactly equivalent to running the traditional CKY algorithm, and so still requires O(|R|T 3 ) time.
Transformer-based CKY approximations The remainder of the paper focuses on training a model (i.e., an inference network (Kingma and Welling, 2014;Johnson et al., 2016;Tu and Gimpel, 2018)) to predict β * directly, in the hope of parsing more efficiently.Rather than use ℓ R and ℓ E (x) to recursively compute β * as the traditional CKY algorithm does, or compute β * by differentiating with respect to β, we instead propose to simply train a transformer to map a sentence to a correct β * for it.The hope is that the trained transformer will learn to represent information about the grammar implicitly in its weights, and that it can then be used without log potentials to parse unseen sentences from the same grammar.This approach is illustrated in schematic form in Figure 1.
We seek to evaluate our trained inference network's approximation to β * on held-out sentences that are from the same distribution (i.e., sampled from the same PCFG) as the training sentences.This contrasts with the vast majority of work on approximating classical algorithms with neural networks (Graves et al., 2014;Kaiser and Sutskever, 2016), which instead evaluates performance on outof-distribution, and in particular longer, inputs than those on which the model was trained.We do not focus on length generalization, first because we believe generalizing even in-domain is useful in parsing applications, and second because we find transformers trained on random PCFGs struggle to generalize even to inputs of the same length.
CKY variants Modern span-based neural constituency parsers do not assume an underlying PCFG.Rather, these parsers score spans compositionally (Stern et al., 2017;Gaddy et al., 2018;Kitaev and Klein, 2018) using log potentials ℓ S ∈ R T ×T ×|N | produced by a neural network, such as a transformer.Such parsers use a CKY variant that requires O(T 3 ) computation time.We focus on using transformers to approximate both the classical CKY algorithm and this simpler variant.

Predicting Parses
As described in Section 2, we define β * ∈ {0, 1} T ×T ×|N | to be a chart-sized tensor representing a CKY parse of a sentence x, with β * [i, j, a] = 1 if and only if nonterminal a yields x i:j .We will use a very simple approach to predict β * with a transformer.
Let h i ∈ R d be an encoder-only transformer's final-layer representation of the i-th token in x, and let h i,:d/2 and h i,d/2: , both in R d/2 , be (respectively) the first and last d/2 elements in h i .We where above we have concatenated the first half of h i and the second half of h j , and where FFN is a BERT-like (Devlin et al., 2019) classification head comprising a single hidden-layer, GELU nonlinearity (Hendrycks and Gimpel, 2016), Layer-Norm (Ba et al., 2016), and final linear projection to |N |+1 scores.This classification head produces a score for each nonterminal label as well as for a non-constituent label.This representation is similar to (but distinct from) that computed by Kitaev and Klein (2018) before using CKY on top of the computed scores.We set the lower triangle of β (i.e., along the first two dimensions) to zero, and use it as our approximation of β * .
Decoding Note that forming β from the h i is only quadratic in sentence-length, and so if we can extract a parse from it in sub-cubic time we will improve (asymptotically) over CKY.In practice, we simply take the highest-scoring label for each span, and ignore spans for which the highest-scoring label is non-constituent.We then sort the predicted spans, first in decreasing order of end-token and then (stably) in increasing order of start-token, and build up the tree from left to right.We thus incur only an O(T 2 (|N | + log 2 T )) decoding cost. 1raining Given a gold (binarized) parse β * , we simply treat predicting the labels of each span as T (T + 1)/2 independent multi-class classifi-cation problems, and we use the standard crossentropy loss.Note that Corro (2020) proposes this independent-span-classification training approach in the context of discontinuous constituency parsing (though with a fixed zero-weight for the nonconstituent label).We note that concurrent work by Yang and Tu (2023) also explores both training and decoding by making span predictions independently.

Decoding Parses from Chart Gradients
As described in Section 2 (also see Appendix A), a CKY parse is a subgradient of the "max score" partition function, which calculates the maximum log (joint) probability under a PCFG of a sentence and its parse tree.If we are interested in making CKYlike predictions, then, it may be a useful inductive bias to form a predicted parse from the gradient of some scoring function, just as CKY does.We propose to use a transformer to define this scoring function, and to predict a parse from the gradients of this scoring function with respect to its inputs.
More concretely, again letting h i be an encoderonly transformer's final-layer representation of the i-th token in sentence x, we define the following "inner" score of x: where FFN is a BERT-like classification head producing only a single logit.Thus, we simply meanpool over the transformer's final-layer token representations, and feed them to a feed-forward network to obtain a scalar score.
Let h (l) i denote an encoder-only transformer's l-th layer representation of the i-th token.Since score is a differentiable function of all the h (l) i , we may take gradients with respect to them.In particular, let That is, g i is the average of the gradients of score with respect to each transformer layer's representation of the i-th token.We can then form span-predictions from the g i , substituting them for the h i in Equation (1), to obtain In the remainder of the paper, we refer to models making use of Equation (2) as "grad decoding." Training a grad decoding model requires backpropagating parameter gradients through a back-propagation with respect to the h (l) i .Fortunately, this is now simple to achieve with modern auto-differentiation frameworks, such as Pytorch (Paszke et al., 2019), which we use in all experiments.
Discussion It is worth noting that while β * is itself a subgradient of the max score partition function, our proposal above merely decodes β+ from the gradient of the inner score function.That is, the gradients g i are concatenated and fed into an additional classification head, which is used to produce something chart-sized.The reason for this discrepancy is computational: if we were to feed something chart-sized into our score function, and if score required running a transformer over its input, our approach would be quartic in T .Instead, our approach retains quadratic dependence on T .Because it involves calculating gradients with respect to the LT d-sized transformer representations, however, it is in practice more expensive than using Equation (1); see Appendix C for details.

Constituency Parsing Experiments
We conduct experiments in two main settings.We first consider modern neural constituency parsing on standard benchmark datasets.We then consider parsing under randomly generated PCFGs.We highlight several important differences between these two settings.First, modern constituency parsing is grammarless.As such, modern constituency parsers do not predict the highest scoring parse under some PCFG, and they use a simpler variant of CKY which composes spans but has no notion of grammar-rules.While it is still interesting (at least from a computational efficiency perspective) to see if directly predicting parses is competitive with running this simpler CKY variant, it is difficult to distinguish a transformer learning this CKY variant from it simply learning to predict gold parses.This concern motivates our second setting, of randomly generated PCFGs, where there are well-defined highest-scoring parses under each PCFG, and where we can evaluate whether the transformer has predicted them.We consider this random PCFG setting in Section 5.
Datasets We conduct constituency parsing experiments on the English Penn Treebank (PTB; Marcus et al., 1993), the Chinese Penn Treebank (CTB; Xue et al., 2005), as well as the treebanks in the SPMRL 2013 and 2014 shared tasks (Seddah et  2013).We use the standard dataset splits throughout.
Model Details We adapt the chart parser first proposed by Kitaev and Klein (2018), and later refined by Kitaev et al. (2019) to involve fine-tuning a pretrained model.In particular, while Kitaev et al. (2019) fine-tune a pretrained BERT model (Devlin et al., 2019) using a margin-based structured loss between the CKY parse and the gold parse, we instead fine-tune BERT to simply predict the parse as in Equation ( 1), and we train with the independent-span cross-entropy loss described in Section 3. Some of the experiments in Kitaev et al. (2019) also make use of an additional factored self-attention layer that consumes the BERT representations, but we do not use this layer when predicting according to Equations (1) or (2).
Our implementation is a modification of the public Kitaev et al. (2019) implementation,2 which also forms our main baseline.While this parser is not always state-of-the-art, it is quite close, and state-of-the-art parsers generally make use of its architecture and approach (Mrini et al., 2020;Tian et al., 2020).
The pretrained models used to initialize both the Kitaev et al. (2019) model and our own for the English and Chinese treebanks are BERT-largeuncased (Devlin et al., 2019) and BERT-basechinese3 , respectively; BERT-base-multilingualcased (Devlin et al., 2019) is used to initialize models for Korean, German, and the rest of the SPMRL treebanks (see Appendix D).As in the implementation of Kitaev et al. (2019), we train with AdamW (Loshchilov and Hutter, 2018;Kingma and Ba, 2015) until validation parsing performance stops increasing.We use the same learning rate scheduler as suggested in Kitaev et al. (2019)  which starts with 160 steps of warm-up, then decreases the learning rate by multiplying it by 0.5 when the F 1 score stops improving.We used a gridsearch to select hyperparameters, and we provide the grid search details and the optimal hyperparameters found in Appendix F.

Results
In Table 1 we report parsing results on these datasets, using the standard evalb evaluation.We find that our approach is competitive with the Kitaev et al. ( 2019) approach, slightly outperforming it on the English and Korean datasets, and slightly underperforming it on the Chinese and German datasets (see Appendix D for the results on the rest of the SPMRL treebanks).In this setting, grad decoding does not improve over simply using Equation (1).However, as we discuss in Section 5, sharing transformer layers is important in seeing the benefits of grad decoding, which we cannot do effectively with pretrained BERT models.
The fact that our simplest approach is competitive on these constituency parsing benchmarks is interesting, given that it is much faster.In Table 2 we compare the speed of our approach (in sentences parsed per second) to that of the baseline parser, and we see it is roughly two times faster.These numbers reflect the time necessary to parse the PTB and CTB development sets using the Kitaev et al. (2019) parser, and using our modification of it.Both experiments were run on the same machine, using an NVIDIA RTX A6000 GPU.We also emphasize that this comparison is somewhat favorable to the baseline parser, since we use the original Table 3: Performance of our approaches and baseline when trained from scratch and evaluated in terms of F 1 on the PTB and CTB development sets (respectively)."SL" indicated that transformer layers are shared.code's batching, whereas our approach makes it extremely easy to create big, padded batches, and thus speed up parsing further.Additionally, we made a comparison with Supar's (Zhang et al., 2020b,a) implementation, which we did not include since it was slower than Kitaev et al. (2019).
While the above experiments all fine-tune pretrained BERT-style models for parsing, it is also worth examining the performance of these models when trained from scratch.We accordingly train transformers from scratch using models of the same size as those in Table 1 on the PTB and CTB training sets, and report results on the development sets in Table 3.We find again that non-CKY based models are competitive.Furthermore, since we can now easily share transformer layers without sacrificing the advantage of pretrained layers, grad decoding has a more positive effect.We did not see a corresponding benefit to sharing transformer layers when predicting without grad decoding.

Random PCFG Experiments
To generate random PCFGs, we follow the method used by Clark and Fijalkow (2021). 4Their method involves first generating a synthetic context-free grammar (CFG) with a specified number of terminals, non-terminals, binary rules, and lexical rules.To assign probabilities to the rules, they then use an EM-based estimation procedure (Lari and Young, 1990;Carroll and Charniak, 1992) to update the production rules such that the length distribution of the estimated PCFG is similar to that of the PTB corpus (Marcus et al., 1993).

Data
The approach of Clark and Fijalkow (2021) allows us to construct random grammars with a desired number of nonterminals and rules, and we generate grammars having 20 nonterminals,5 and 100, 400, and 800 rules, respectively.The number of terminals and lexical rules is set to 5000.Having more rules per nonterminal generally increases ambiguity, and so we would expect a synthetic grammar with 800 rules to be significantly more difficult to parse than one with 100, and a synthetic grammar with 20 nonterminals to be more difficult than one with more.
It is common to quantify ambiguity in terms of the conditional entropy of parse trees given sentences (Clark and Fijalkow, 2021), which can be estimated by sampling trees from the PCFG and averaging the negative log conditional probabilities of the trees given their sentences.The conditional entropy of a PCFG G is thus estimated as where the (x (n) , τ (n) ) are sampled from G, and where p G (x (n) ) is calculated with the inside algorithm.We use N = 1000 samples.Below we report the ĤG (τ |x) of each random grammar along with parsing performance; we will see that ĤG (τ |x) negatively correlates with performance.
Our datasets consist of sentences sampled from these PCFGs and their corresponding parses, which were parsed with the CKY algorithm.
Dataset size Because we are interested in testing the ability of transformers to capture the CKY algorithm, we must ensure that the training set is sufficiently large that prediction errors can be attributed to the transformer failing to learn the algorithm, and not to sparsity in the training data.We ensure this by simply adding data until validation performance plateaus.In particular, we use ∼200K sentences for training and 2K held-out sentences.To control the complexity of our datasets, we also limit the maximum sentence length to 30, which decreases the average sentence length from 20-25 words per sentence to 18.

Models and baselines
Whereas the experiments in the previous section mostly make use of standard BERT-like architectures, either fine-tuned or trained from scratch, in this section we additionally consider making parse predictions (as in Equations ( 1) and ( 2)) with transformer variants which are intended to improve performance on "algorithmic" tasks.Indeed, recent  4: Labeled span F 1 performance (as in evalb) on randomly generated PCFGs with |N | = 20 nonterminals, and |R| = 100, 400, and 800 rules, respectively.ĤG (τ |x) indicates the estimated conditional entropy of the grammar.∆ lp is the average log probability difference between CKY and predicted parses."SL" indicates that transformer layers are shared, "CG" that a copy-gate is used, and "GD" that grad decoding is used.
One major architectural modification common to both UT and NDR is that all transformer layers share the same parameters; this modification is intended to capture the intuition that recursive computation often requires applying the same function multiple times.NDR additionally makes use of a "copy gate," which allows transformer representations at layer l to be simply copied over as the representation at layer l + 1 without being further processed, as well as "geometric attention," which biases the self-attention to attend to nearby tokens.We found geometric attention to significantly hurt performance in preliminary experiments, and so we report results only with the shared-layer and copy-gate modifications.
All models in this section are trained from scratch.This is convenient for UT-and NDR-style architectures, for which we do not have large pretrained models, but we also found in preliminary experiments that on random PCFGs pretrained vanilla transformers (such as BERT) did not improve over transformers trained from scratch.We did find a modest benefit to fine-tuning a BERT-style model that we pretrained on sentences from our randomly generated grammars (rather than on natural language), which accords with the recent work of Zhao et al. (2023); see Appendix E for details.Our implementation is based on the BERT implementation in the Hugging Face transformers library (Wolf et al., 2020).Evaluation We evaluate our predicted parses against the CKY parse returned by torch-struct (Rush, 2020), using F 1 over span-labels as in evalb.Since there may be multiple highest-scoring parses, we also report the average difference in log probability, averaged over each production, between the CKY parse returned by torch-struct and the predicted parse.We refer to this metric as "∆ lp " in tables.Because transformer-based parsers may predict invalid productions, we smooth the PCFG by adding 10 −5 to each production probability and renormalizing before calculating ∆ lp .

Results
The results of these synthetic parsing experiments are in Table 4, where we show parsing performance over three random grammars with different conditional entropies.We see that transformers struggle as the ambiguity increases, suggesting that these models are not in fact implementing CKY-like processing.At the same time, we see that incorporating inductive bias does help in nearly all cases, and that gradient decoding together with sharing transformer layers performs best for all grammars.
We conduct a speed evaluation in Table 5, where we compare with the CKY implementation in torch-struct (Rush, 2020), which is optimized for performance on GPUs.For a fixed number of nonterminals, neither the speed of torch-struct's CKY implementation nor of our transformer-based approximation is significantly affected by the number of rules.Accordingly, we show speed results for parsing just the synthetic grammar with 400 rules, for three different batch-sizes.All timing experiments are run on the same machine and utilize an NVIDIA RTX A6000 GPU.We see that the transformer-based approximation is always faster, although its advantage decreases as batch-size increases.We note, however, that because our implementation uses Hugging Face transformers components (Wolf et al., 2020), which at present do not make use of optimization such as FlashAttention (Dao et al., 2022), our approach could likely be sped up further.
Our approach most closely resembles chartbased methods, in that we compute scores for all spans for all non-terminals.However, unlike chart-based parsers, we aim to only approximate, or amortize, running the CKY algorithm.Unlike transition-based or sequence-to-sequence-style methods, our approximation involves attempting to predict a parse jointly and independently, rather than incrementally.Within the world of chart-based neural parsers, those of Kitaev and Klein (2018) and Kitaev et al. (2019) have been enormously influential, and state-of-the-art constituency parsers, such as those of Mrini et al. (2020) and Tian et al. (2020), adapt this approach while improving it.
The approach of employing an independent-span classification training objective along with greedy decoding for inference has also been explored by Zhang et al. (2019) in the context of depen-dency parsing.It is worth emphasizing that while the Zhang et al. (2019) work shows that a scoring model with quadratic complexity can approximate the also quadratic Chu-Liu-Edmonds algorithm (Chu and Liu, 1965), used for decoding in dependency parsing, our work focuses on approximating a cubic complexity decoding algorithm using a scoring model with quadratic complexity.
We are additionally motivated by recent work on neural algorithmic reasoning (Xu et al., 2019;Csordás et al., 2021;Dudzik and Veličković, 2022;Delétang et al., 2022;Ibarz et al., 2022;Liu et al., 2023), some of which has endeavored to solve classical dynamic programs (Veličković et al., 2022) and MDPs (Chen et al., 2021) with graph neural networks and transformers, respectively.One respect in which learning to compute CKY differs from many other algorithmic reasoning challenges (including other dynamic programs) is that in addition to its discrete sentence input, CKY also consumes continuous log potentials.
Finally, the idea of training a model to produce the solution to an optimization problem is known as training an "inference network," and has been used famously in approximate probabilistic inference scenarios (Kingma and Welling, 2014) and in approximating gradient descent (Johnson et al., 2016).Most similar to our approach, Tu and Gimpel (2018) train an inference network to do Viterbistyle sequence labeling, although they do not consider parsing, and they require inference networks for both training and test-time prediction due to their large-margin approach.

Discussion and Conclusion
Our findings on the ability of transformers to approximate CKY are decidedly mixed.On the one hand, using transformers to independently predict spans in a constituency parse is competitive with using very strong neural chart parsers, and it is moreover much faster to predict parses in this independent, non-CKY-based way.On the other hand, if transformers are capturing the CKY computation, they ought to be able to parse even under random PCFGs, and it is clear that as the ambiguity of the grammar increases they struggle with this.
We have also found that making a transformer's computation more closely resemble that of a classical algorithm, either by sharing computation layers as proposed by Dehghani et al. (2018) and Csordás et al. (2021), or by having it make use of the 8 gradient with respect to a scoring function, is helpful.This finding both confirms previous results in this area, and also suggests that the inductive bias we seek to incorporate in our models may need to closely match the problem.
There are many avenues for future work, and attempting to find a minimal general-purpose architecture that can in fact parse under random PCFGs is an important challenge.In particular, it is worth exploring whether other forms of pretraining (i.e., pre-training distinct from BERT's) might benefit this task more.Another important future challenge to address is whether it is possible to have a model consume both the input sentence as well as the parameters (as the CKY algorithm does), rather than merely pretrain on parsed sentences generated using the parameters.

Limitations
A limitation of the general paradigm of learning to compute algorithmically is that it requires a training phase, which can be expensive computationally, and which requires annotated data.This is less of a limitation in the case of constituency parsing, however, since we are likely to be training models in any case.
Another important limitation of our work is that we have only provided evidence that transformers are unable to implement CKY in our particular experimental setting.While we have endeavored to find the best-performing combinations of models and losses (and while this combination appears to perform well for constituency parsing), it is possible that other transformer-based architectures or other losses could significantly improve in terms of parsing random grammars.
We also note that a limitation of the grad decoding approach we propose is that we have found that it is more sensitive to optimization hyperparameters than are the baseline approaches.
Finally, we note that our best-performing constituency results make use of large pretrained models.These models are expensive to train, and do not necessarily exist for all languages we would like to parse.

A Additional background on CKY
Letting ℓ R ∈ R |R|×|N |×|N | represent the log potentials (e.g., log probabilities) associated with the rules in PCFG G, and ℓ E (x) ∈ R T ×|N | the log potentials corresponding to each token in input sentence x, we compute the chart β ∈ R T ×T ×|N | for x, where β[i, j, a] represents the sum (under a particular semiring) of all weight associated with the a-th non-terminal yielding x i:j .In particular, with semiring operations ⊕ and ⊗, and if Assuming the first slice of ℓ R along the first dimension (i.e., ℓ R 1,:,: ) corresponds to rules with S on the left-hand-side, the log partition function is then given by A Furthermore, under the max-plus semiring, one of the subgradients of A(ℓ R , ℓ E ) with respect to β is a one-hot representation of a highest scoring parse for x under G (Eisner, 2016;Rush, 2020).We refer to such a one-hot subgradient as β * ∈ {0, 1} T ×T ×|N | .Rush (2020) therefore proposes to compute β * using automatic differentiation, and provides a fast implementation tuned for use on GPUs.

B Model, Training, and Dataset Details
Dataset Details We provide details on the standard constituency parsing datasets used in our experiments in Table 6 (Marcus et al., 1993), the Chinese Penn Treebank (Xue et al., 2005), and the SPMRL Treebanks (Seddah et al., 2013).
Terms of Use We used the standard English Penn Treebank, Chinese Penn Treebank, and treebanks (except for Arabic) from SPMRL 2013 and 2014 shared tasks in accordance with their licenses.Both PTB and CTB are under the Linguistic Data Consortium (LDC) licenses.The German, Hebrew, Korean, and Swedish Treebank are not under any specific licenses.The Basque Treebank is licensed under the Creative Commons license.The Polish Treebank is licensed under GPL v3.We also use Hugging Face code and models in accordance with their licenses (Apache 2.0).For our deep learning framework, we use PyTorch (Paszke et al., 2019) which is under the BSD-3 license.We also make use of code provided by Kitaev et al. (2019) 6 .Their code is available under the MIT license.
Computational Budget All of our experiments were run on NVIDIA RTX A6000 GPUs.We provide the computational time of our experiments on constituency parsing datasets and a random PCFG.

Memory Comparison
The maximum memory allocation of our model with gradient decoding is 1.52 GB, which is only 10% higher compared to our model without gradient decoding, which has a maximum memory allocation of 1.38 GB.

D Constituency Parsing Results on SPMRL
Table 12 shows that our approach outperforms Kitaev et al. (2019) on half of the SPMRL treebanks.We excluded the Arabic treebank since we were unable to get its corresponding license.
Considering that gradient decoding did not lead to substantial improvements in the results for PTB, CTB, German, and Korean, as indicated in Table 1, we decided not to perform the gradient decoding experiments on the rest of the SPMRL treebanks due to limitations in computational resources.

E MLM Pretraining on Random PCFG-Generated Data
Several works have suggested that pretrained models can capture syntactic information (Hewitt and Manning, 2019;Manning et al., 2020;Maudslay and Cotterell, 2021;Zhao et al., 2023).In particular, Zhao et al. (2023) argued that a connection exists between MLM and the inside-outside algorithm.Through probing, they show that models pretrained on synthetic PCFG data may be approximating the inside-outside algorithm.Since insideoutside and CKY are related, it is natural to question whether MLM pretraining can also be helpful when it comes to approximating CKY.We therefore pretrain a transformer model on 500K sentences generated from the random PCFG with 800 rules.We selected our training setup following the Cramming training recipe (Geiping and Goldstein, 2022).For training, we utilized a large batch size of 4096, accumulating gradients and performing an update every 32 steps.We employed a linear warmup schedule for 10% of the total training steps, with a peak learning rate of 1 × 10 −4 .The model achieved a training perplexity of 92.01 and a validation perplexity of 92.20.
In Table 13, we show that fine-tuning the pretrained model leads to performance improvements, particularly when layers are not shared.It is important to highlight that while pretraining is helpful, it comes with higher computational costs compared to relying solely on inductive biases, and it demonstrates less improvement compared to gradient decoding.

F Hyperparameters
Table 14 shows the grid we used to search for the optimal hyperparameters of the models used in constituency parsing and random PCFG experiments.We also provide the optimal hyperparameters in Table 15 and 16.For the baseline result (Kitaev and Klein, 2018) in Table 1, we used the hyperparameters recommended in their code.The results in Table 1 and 4 make use of the optimal hyperparameters and are based on a single run using a fixed seed throughout all experiments.14 14029  Table 15: Optimal hyperparameters used for our constituency parsing experiments.We denote the learning rate as "LR", weight decay as "WD", and attention dropout as "AD".Similar hyperparameters are used in the baseline model (Kitaev et al., 2019) across the datasets.Results are provided in Table 1.
Figure 1: A schematic view of training and inference under the classical approach (top) and under learned transformer approximation (bottom).Classical parsing has no training phase, and uses pre-defined log potentials to parse unseen sentences.Our proposed learned parser (an "inference network") trains on pairs of sentences and parses computed under a set of log potentials, and then parses unseen sentences without access to the potentials.

Table 1 :
Kitaev et al. (2019)ore on PTB, CTB, and the German and Korean treebanks from the SPMRL 2014 shared task.All models use the same pretrained initialization; see text for details.TheKitaev et al. (2019)a results are those reported in the paper, which make use of an additional factored self-attention layer, whileKitaev et al. (2019)b are the results of running their code without this additional layer. ,

Table 2 :
Inference speed (in sentences/second) of the CKY-based Kitaev et al. (2019) parser and our own, averaged over the PTB and CTB development sets.

Table 5 :
Inference speed of GPU-based torch-struct CKY parsing and our transformer-based approach on a random grammar with |R| = 400 rules.We report the median speed of running our model and CKY 5 times on 60K samples in batches of 32, 64, and 128, respectively. .

Table 6 :
Number of examples in the standard splits of the English Penn Treebank Since the training times of random PCFGs with |R| = 100, 400, and 800 rules are similar, we only provide that of the PCFG with |R| = 400.

Table 8 :
Model sizes for random PCFG experiments.
Speed ComparisonIn Table11we compare the inference time between our models (with and without gradient decoding) and CKY.

Table 11 :
Inference speed of GPU-based torch-struct CKY parsing and our transformer-based approach on a random grammar with |R| = 400 rules.We report the median speed of running our models and CKY 5 times on 60K samples in batches of 32, 64, and 128, respectively.

Table 12 :
Kitaev et al. (2019)ore on the test sets of the SPMRL treebanks.TheKitaev et al. (2019)results are those reported in the paper.

Table 14 :
Kitaev et al. (2019)search for the optimal hyperparameters in our constituency parsing and random PCFG experiments.aThedefault scheduler used inKitaev et al. (2019)as mentioned in section 4. b 160 steps of warm-up then decreasing the learning rate linearly.

Table 16 :
Optimal hyperparameters used for our random PCFG experiments.We denote the learning rate as "LR", weight decay as "WD", attention dropout as "AD", and the number of warmup steps as "WS".Results are provided in Table4.