Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail recursive structure and exhibit sample-inefficient syntactic generalization. This work introduces Pushdown Layers, a new self-attention layer that models recursive state via a stack tape that tracks estimated depths of every token in an incremental parse of the observed prefix. Transformer LMs with Pushdown Layers are syntactic language models that autoregressively and synchronously update this stack tape as they predict new tokens, in turn using the stack tape to softly modulate attention over tokens -- for instance, learning to"skip"over closed constituents. When trained on a corpus of strings annotated with silver constituency parses, Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization, while maintaining similar perplexities. Pushdown Layers are a drop-in replacement for standard self-attention. We illustrate this by finetuning GPT2-medium with Pushdown Layers on an automatically parsed WikiText-103, leading to improvements on several GLUE text classification tasks.


Introduction
An important property of human language and thought is recursion, which allows us to compose and reason about complex objects in terms of simpler constituents (Hauser et al., 2002).While extensively studied in natural language syntax and semantics, recursion is also a key component of several other aspects of intelligent behaviors including mathematical reasoning, programming, and goaldirected planning.Most recursion-capable systems model recursive processes via a stack memory, which is updated as new computation is performed.The stack is updated synchronously with the newly predicted word, via an attachment head that selects a constituent to reduce the newly predicted word with, via attention.
For instance, a programming language may implement recursion by maintaining a run-time stack of caller-callee frames, storing intermediate outputs in the stack, and updating the stack as new function calls are made.Similarly, a shift-reduce parser implements recursion through a stack of intermedi-ate constituents, shifting tokens onto the stack as they are observed, and occasionally reducing stack elements into constituents as they are completed.
In contrast, the self-attention mechanism underlying modern neural sequence models has no explicit mechanism to maintain a stack memory as it generates strings, and instead relies on hidden representations to implicitly but imperfectly encode such information (Manning et al., 2020).While this encoding can model bounded recursive structure in formal languages (Yao et al., 2021), it is unclear if it is sufficient for robust syntactic generalization, especially under data-constrained settings.
In this work, we show that an explicit stack memory mechanism can improve syntactic generalization in Transformer language models (LMs).We introduce Pushdown Layers1 , a drop-in replacement for standard self-attention that augments Transformer LMs with stack memory.This memory is modeled using a stack tape that stores estimated depths of every token in an incremental parse of the observed prefix.The stack tape is updated autoregressively: as new tokens are predicted, Transformers with Pushdown Layers (Pushdown Transformers) synchronously make probabilistic attachment decisions to either "shift", thus assigning the newly predicted token a depth of 0, or "reduce" with one of the constituents in the prefix so far, updating token depths accordingly (see Fig. 1).This stack tape is used to additively and softly modulate the attention of the Transformer over tokens-for instance, Pushdown Layers may guide the LM to only attend to head words of constituents, or skip over reduced constituents by decreasing attention.
Pushdown Transformer LMs are syntactic language models that learn joint probabilities of sequences and parses in terms of individual word predictions and structure-building operations, and can be trained on any text corpus annotated with constituency parses.But unlike other syntactic language models with structural supervision (Vinyals et al., 2015;Choe and Charniak, 2016;Qian et al., 2021;Sartran et al., 2022), Pushdown Layers do not change the output space of the underlying sequence model, and impose no constraints on attention mechanisms-the manner in which Pushdown Layers use syntactic structure for representation building is learnt purely via gradient descent.
Pushdown Transformers obtain strong general-

Background
Multi-Head Self-Attention.Transformer language models (Vaswani et al., 2017) are a class of neural sequence models that use multi-head self-attention to obtain contextualized representations of tokens in a sequence, which are then used to predict the next token.In particular, let x = {x 1 , x 2 , . . ., x n } be an input sequence.Let h l i ∈ R d be the hidden representation of the i th token at the l th attention block.Then, the hidden representation of the i th token is updated as where O ∈ R d×d is a learnt matrix, FF denotes a feed-forward + residual + layer-norm block, and A p is the p th self-attention head.Each attention head performs a weighted average over its inputs, where α ij is the attention weight assigned to the j th token by the i th token.These attention weights are computed as Limitations of Self-Attention.When trained on text corpora, transformers implicitly encode several aspects of linguistic structure unsupervisedly (Clark et al., 2019;Hewitt and Manning, 2019;Murty et al., 2023).However, there is mounting evidence that recursion, a key feature of human language, remains a challenge.Hahn (2020) shows theoretically that hard-attention cannot model simple recursive structures like 2DYCK (see Section 6 for an extended discussion).Empirically, Lakretz et al. (2022) show that self-attention struggles on center embedding phenomenon, and Zhang et al.
(2023) show poor performance on simple recursive tree-traversal problems.We hypothesize that a key reason for poor modeling of recursive structure in self-attention is a lack of an explicit structural inductive bias.One common way to add such an inductive bias is via joint modeling of strings and syntactic structure, which we introduce next.
Syntactic Language Models.Let y be the ground-truth syntactic parse of x.A long line of work (Vinyals et al., 2015;Dyer et al., 2016;Choe and Charniak, 2016;Qian et al., 2021;Sartran et al., 2022) considers learning joint distributions p(x, y) to incorporate explicit syntactic structure into neural language models, by learning to output a sequence of transition actions, where actions a i correspond to both word-level predictions as well as structural actions corresponding to opening and closing of constituents, building up the parse tree in a top-down, left-to-right manner.These models have several limitations that motivate our proposed approach.First, their outputs are sequences of transition actions that include both text and tree-building operations; as each constituent in a parse tree has an opening and closing transition action, and there are ≈ n constituents for x, this increases input length by a factor of 3, leading to significant computation and memory over-head.Second, inference in neural models operating on transitions require bespoke decoding procedures that carefully balance tradeoffs between highentropy word-level predictions and low-entropy structural predictions (Stern et al., 2017).Finally, to explicitly bias Transformer computations to mirror the recursive structure of parse trees, some approaches like PLM-mask (Qian et al., 2021) and TGs (Sartran et al., 2022) impose hard constraints on attention patterns.Pushdown Layers provide a softer syntactic bias that is amenable to gradientbased learning, while having broader applicability to phenomena beyond local tree-structuredness, such as topical dependencies, coreference, etc.

Pushdown Layers
Transformer LMs with Pushdown Layers are syntactic language models that generate strings while simultaneously building a parse tree over these strings from left to right.This parse tree is built incrementally by tracking the recursive state of every token, which is synchronously updated along with word-level predictions.This recursive state is represented via our stack tape as tree-depths of every prefix token, and updates are realized with a stack.The contents of the stack tape are used to softly modulate attention over prefix tokens via additive offsets to attention logits (Fig. 2).

Stack Tape
Like ordinary self-attention, Pushdown Layers take a sequence of hidden states {h l k } as input, and output a sequence {h l+1 k }.Additionally, Pushdown Layers use a stack tape W k ∈ {0, k} k to simulate a pushdown automaton that performs shift/reduce operations over tokens as they are predicted (Fig. 2).The contents of the stack tape encode recursive state by tracking the depth of each token within reduced constituents in the stack.Concretely, after observing the prefix x ≤k = {x 1 , x 2 , . . ., x k }, W k [j] = 0 if token x j has not been reduced with any other token, while W k [j] = p means that x j has appeared in p reduce operations such that the resulting constituent has token x j at depth p-in Fig. 2, the stack tape encodes [1, 1, 0] for the incremental parse [The dog] is.
Updating the Stack Tape.As shown in Fig. 2, along with predicting the next word happy, Transformers with Pushdown Layers (Pushdown Transformers) make an attachment decision to update their stack tape.In our running example, this is ] is built as a unique sequence of stack-tape updates in Pushdown LMs.Here, as the word happy is predicted, the attachment head chooses a constituent (bolded) from the current incremental parse, via attention.Attachment decisions are made to constituents by attending to their rightmost token, and none of the other tokens of a constituent can be attended to (shown as dashed lines).These attachment decisions are used to update depth values in the tape.
done by selecting a constituent from the incremental parse [The dog] is happy.
Concretely, given prefix x <k , Pushdown Transformers predict the next token x k as well as an update to the stack tape W k−1 .This is done by selecting a token r k to reduce with, out of candidate tokens {x 1 , x 2 , . . ., x k }, via attention over hidden where L is the final layer of the Transformer, and hL k is a vector representation for the newly predicted token x k , obtained as hL k = MLP(x k , h L k−1 ).This vector attends to all tokens to make a probabilistic attachment decision, where W ∈ R d×d is a learnt parameter matrix.We use these probabilities to select token r k = arg max p(j | x <k ; W k−1 ) to reduce x k with, and the stack tape is updated accordingly via Algorithm 1.Note that attachment decisions to constituents are made by computing the attachment score for the rightmost token in the constituent.In our running example, the model selects the constituent [The dog] by selecting the word dog, forming the parse [[The dog] [is happy]] and updating the stack tape from

Computing Attention Scores
We map contents of W k onto a per-layer depth embedding d l kj for every token j ∈ {0, 1, . . ., k}.These depth embeddings are added to attention keys, resulting in a locally additive modulation to attention scores, Of course, since these logits are themselves part of a softmax and non-linearities, the overall effect can be arbitrarily non-linear.These modified attention weights are used to compute contextualized vectors using Eq 2 and Eq 1.

Training and Inference
Training.Given a corpus of strings annotated with parses, we first extract ground-truth values of W k for every prefix x ≤k .We also extract groundtruth attachment decisions for x k , given prefix x <k .With these quantities precomputed, we can train Pushdown Transformers in parallel, like standard Transformers.Attachment probabilities (Eq 5) are supervised with ground-truth attachments, along with the standard LM objective, all using hidden states that are contextualized using the Pushdown Layer attention mechanism that uses the precomputed stack tape.
Inference.For any string x and parse y, joint probability p(x, y) factorizes as a product of wordlevel and attachment scores as While computing the full marginal p(x) = y p(x, y) is computationally infeasible due to the large space of possible parses, we approximate this by marginalizing over a smaller subset with beam search.Crucially, since our model predicts words and structural actions in parallel rather than sequentially, we do not need to use complex wordsynchronous decoding procedures (Stern et al., 2017) that introduce additional hyperparameters.

Implementation Details
FLOPs and memory overhead.Consider query and key matrices Q ∈ R n d ×d , K ∈ R ns×d where n d and n s refer to destination (hidden states attending) and source (hidden states being attended to).Let S ∈ R n d ×ns be the (lower-triangular) matrix denoting pre-computed stack tape values for every prefix.For each Pushdown Layer, we use S to index into depth embeddings to obtain D ∈ R n d ×ns×d , which is added to K to obtain K D ∈ R n d ×ns×d .Unlike standard self-attention which multiplies Q and K directly, Pushdown Layers multiply Q (a 2D tensor) with K D (a 3D tensor).This is done by casting Q into a 3D tensor ∈ R n d ×1×d and performing a batched matrix multiplication with K D , leading to the same number of operations as standard self-attention2 .However, since Pushdown Layers require storing 3D tensors for keys, this increases memory re- We provide standalone code for implementing a Pushdown Layer block in Appendix D.
Attending to hidden states with old memory.Pushdown Transformers build parse trees incrementally from left-to-right, and so, depth values of prefix tokens change as new tokens are predicted.Thus, a token at position i builds its representation based on attending to x ≤i with a stack tape that may soon become "stale" due to future transition operations that reduce tokens in x ≤i with new tokens.As an example, suppose we have the incremental parse

Warm-up: Dyck Languages
We train 6 layer LMs with Pushdown Layers (Pushdown-LM) as well as standard LMs on 100k strings sampled from DYCK 20,10 , the language of well-nested brackets with 20 bracket types and max-nesting depth of 10.To ensure that improvements are not merely due to multi-task learning with an attachment head, base-LM is also trained with an attachment loss in a standard multi-task learning setup.To test generalization, models are provided an input prefix from a separate DYCK language, and evaluated on choosing the correct closing bracket.Specifically, we test generalization to DYCK strings with deeper nesting of 15-50, and DYCK strings with longer-range dependencies than seen at training time (measured as the distance to the matching bracket that needs to be closed).From Table 1, we find that Pushdown-LM obtains over 25% accuracy point improvement over standard language models at generalizing to deeper structure, as well as large improvements at generalizing to longer-range dependencies.2017)), parsed automatically using a state-of-the-art neural constituency parser (Kitaev et al., 2019).Typically, LMs trained on web-scale data are given multi-sentence contexts with large window sizes as inputs, and to adapt this to Pushdown-LMs we make a small number of modifications (see Appendix B for details).

Base-LM
Sample-Efficient Generalization.To measure sample efficiency in Pushdown Transformers, we train LMs on [10M, 50M, 100M] tokens from WIK-ITREES.To ensure stable training under low data regimes, we train a 12 layer GPT2 using the exact configuration and tokenization scheme as GPT2small (Radford et al., 2019), and additionally use dropout to prevent overfitting.For these experiments, we compare Base-LM with an LM where the final 6 self-attention blocks are Pushdown Layers (Pushdown-LM).To measure syntactic generalization, we compute aggregate performance on the SG test suites.From results in Fig. 4, we find that Pushdown-LMs exhibit drastically more sample-efficient syntactic generalization-for instance, syntactic generalization of Pushdown-LM trained on 10M tokens requires over 40M tokens for the Base-LM to surpass.
Finetuning for text classification.Can Pushdown Layers offer improvements on language understanding tasks, beyond syntactic generalization?
To answer this, we perform staged finetuning of GPT2-medium with Pushdown Layers.Specifically, we finetune GPT-2 medium with the final 12 self-attention blocks replaced with Pushdown Layers (Pushdown-GPT2), as a language model on We find that Pushdown Layers greatly improve sample efficiency of syntactic generalization.For reference, we also include GPT2-small, which is trained on over 9 billion tokens.
WIKITREES.We use this model to obtain parses on 4 text classification tasks: RTE, SST5, MRPC and STS-B from GLUE (Wang et al., 2019a), and use these parses to pre-compute the stack tape for every token.Then, in a second finetuning step, Pushdown-GPT2 is trained to perform text classification over these datasets by reducing each task into language modeling via prompting (See Appendix A for prompt details).As a comparison, we also perform the same staged finetuning for the standard GPT2-medium architecture.We report averaged results across 3 seeds in Table 3.We find that Pushdown Layers offer improvements on 3 out of 4 text classification tasks.

Analysis
For all analyses, we use the 16 layer Pushdown-LM trained on BLLIP-LG from Section 4.2.
Parsing.Since Pushdown-LM is a syntactic language model, we obtain parses via beam search (beam size = 300) to approximately recover the most likely parse y * = arg max y p(x, y) under our Figure 5: For the three subject-verb agreement tasks from (Marvin and Linzen, 2018), we compute average attention over the distractor noun when the verb is being predicted, for both the Base-LM and Pushdown-LM (ours).Across all variants, we find that our model consistently pulls attention away from distractor nouns.model.However, since this parse is (a) unlabeled and (b) binarized, we perform an unlabeled F1 evaluation (using EVALB; Collins, 1997) over binarized ground-truth parses from the PTB test set.We also remove instances consisting of unknown words for our model, since our model is trained without any UNK tokens, giving us 2335 out of 2416 sentences.We compare our model against Kitaev et al. (2019), the parser that was used to annotate training data for Pushdown-LM.We also present unlabeled F1 on the auto-parsed BLLIP-LG test set.From results in Table 4, we note that our model achieves a very competitive unlabeled F1 score of 95.3, outperforming the official implementation of Kitaev et al. (2019) 3 .We also find that our model obtains a high F1 score of 97.3 on the BLLIP-LG test set.Case Study: Analyzing attention patterns on subject-verb agreement tasks.We consider the 3 Subject-Verb agreement tasks (Marvin and Linzen, 2018) from the SG test suites.On these tasks, models are presented with a prefix consisting of a main subject and a distractor embedded subject, where these items conflict in number.The objective is to assign a higher logprob to the verb that agrees with the main subject rather than the distractor subject.For instance, for prefix The author that hurt the senators, the model must assign a higher probability to is than are.From Fig. 3, we find that Pushdown-LM significantly outperforms other models with close to 80% accuracy while Base-LM achieves less than 60% accuracy.To understand how Pushdown Layers modulate attention on these examples, we obtain attention scores over all prefix tokens (averaged across all layers).We present the average attention assigned to the distractor token for both Pushdown-LM and Base-LM in Fig. 5 where we observe that Pushdown-LM pulls attention away from the distractor noun, allowing it to predict the correct verb.Finally, we plot some (averaged) attention heatmaps in Fig. 6.

Other Related Work
While recursive structure is fundamental to natural language, modeling such structure is difficult for self-attention.Hahn (2020) considers DYCK, the simplest formal language with recursive structure, proving that hard attention cannot recognize DYCK and soft attention cannot recognize DYCK with low cross-entropy.In practice, we find that even simpler languages like PARITY are challenging for encoder-only Transformers (Chiang and Cholak, 2022;Bhattamishra et al., 2020).On the other hand, Transformers with decoders have been shown to be Turing-complete (Perez et al., 2021), but these constructions rely on the impractical assumption of running the decoder for an unbounded number of steps.In practice, we find that Transformer LMs struggle with generalization beyond regular languages and tend to learn shortcuts instead (Deletang et al., 2023;Liu et al., 2023).
Given these limitations, there is significant interest in inductive biases that encourage recursive structure in Transformers.One line of work considers constraining self-attention patterns according to syntactic parses (Strubell et al., 2018;Wang et al., 2019b;Peng et al., 2019;Deshpande and Narasimhan, 2020, among others).A second line of work adds structure to language modeling by learning joint probabilistic modeling of structure and strings (Chelba, 1997;Mirowski and Vlachos, Figure 6: Given a prefix containing a main noun and a distractor noun, Pushdown-LM pulls attention away from the distractor (here senator), helping the model predict the verb with the correct number.These attention maps average across all the instances in the number_src test of SG test suites, and we show the attention over all prefix tokens when the main verb is predicted 2015; Choe and Charniak, 2016;Dyer et al., 2016, among others).Both of these ideas are combined in recent work of Qian et al. (2021); Sartran et al. (2022), that proposes joint string, parse Transformer language models with constrained attention patterns.While Pushdown Layers are also in this modeling tradition, we do so without operating on long transition actions, and enforce structural constraints via gradient based learning.
A separate line of work proposes neural networks augmented with structured memory like stacks (Das et al., 1992;Grefenstette et al., 2015;Joulin and Mikolov, 2015;DuSell and Chiang, 2022) or random access memories (Kurach et al., 2015).Such augmented neural networks are vastly better at algorithmic generalization and learning recursive structure (Suzgun et al., 2019;Deletang et al., 2023).Our work is the first that designs a structured memory (the stack-tape) for Transformers, that is updated just like stacks in a shift/reduce manner, but unlike prior work, the specific design of Pushdown Layers makes training parallelizable.
Finally, there have been several efforts to add syntactic inductive biases into sequence models (typically RNNs) that can acquire and use parse structures in an unsupervised manner (Bowman et al., 2016;Shen et al., 2019;Drozdov et al., 2019;Kim et al., 2019, among others).We leave unsupervised training of Pushdown Transformers for future work.

Conclusion
We propose Pushdown Layers, a new kind of selfattention that augments Transformer language models with a stack based memory.Pushdown Layers enable auto-regressive Transformers to softly bias attention towards a recursive syntactic computation, through an updatable stack-tape that stores token depths in an incremental parse.When trained on synthetic and natural languages, we find that Transformer LMs with Pushdown Layers achieve better generalization to deep recursive structure, as well as better and more sample-efficient syntactic generalization.When pre-trained LMs are finetuned with Pushdown Layers, we obtain improvements on some GLUE tasks.

Reproducibility Limitations
Pushdown Layers require constituency-parse annotated datasets, which may not be available for many languages due to a lack of high performing off-theshelf constituency parsers.This also limits applicability to domains beyond natural and synthetic languages, such as algorithmic reasoning.Finally, Pushdown Layers can only be applied to languages with constituency structure, and our experiments are limited to English.

Figure 1 :
Figure1: (a) Pushdown Layers use a stack-tape to featurize contents of an explicit stack, in terms of estimated token depths, where the stack represents incremental parses.(b) These depths map onto depth embeddings (in blue) that are added to token keys before computing attention scores, softly biasing attention towards a recursive syntactic computation.(c) The stack is updated synchronously with the newly predicted word, via an attachment head that selects a constituent to reduce the newly predicted word with, via attention.

Figure 2 :
Figure 2: Illustration of how the parse [[The dog [is happy]] is built as a unique sequence of stack-tape updates in Pushdown LMs.Here, as the word happy is predicted, the attachment head chooses a constituent (bolded) from the current incremental parse, via attention.Attachment decisions are made to constituents by attending to their rightmost token, and none of the other tokens of a constituent can be attended to (shown as dashed lines).These attachment decisions are used to update depth values in the tape.
[[The dog] [in [the park]]].Here, the representation for in attends to representations of The, dog and in with depths [1, 1, 0] while the representation for park attends to these representations with updated depths [2, 2, 2].

Figure 4 :
Figure4: Comparing a standard GPT-2 small architecture (Base-LM) with a model where the last 6 selfattention blocks use Pushdown Layers, trained on various amounts of tokens from WIKITREES.We find that Pushdown Layers greatly improve sample efficiency of syntactic generalization.For reference, we also include GPT2-small, which is trained on over 9 billion tokens.

Table 2 :
Sartran et al. (2022)nsformer LMs and other syntactic LMs.While Pushdown-LMs are comparable with Transformer Grammars (TG; Sartran et al., 2022) across all examples in SG test suites (Table 2), they outperform TGs on 4 out of 6 tests, including the recursive center embedding tests.Syntactic Generalization on BLIMP and SG test suites.All results for PLM-Mask are taken from Qian et al. (2021) and results for PLM and TGs are taken fromSartran et al. (2022).* denotes differences that are not significant.PPL results marked with ‡ are taken from prior work and not comparable due to differences in tokenization.
Table1: Evaluating LMs at closing Dyck prefixes with longer dependencies (dep.length in brackets) and deeper structure.We find significant improvements from using Pushdown Layers over standard self-attention.Figure 3: Comparing Pushdown-BLIMP, we obtain p(x) by approximate marginalization via beam search.Since surprisal values − log p(x t | x <t ) in SG test suites are meant to reflect incremental sentence processing, we perform marginalization based on the beam state at time step t.We fix the beam size at 300.Can Pushdown Layers continue to offer improvements on larger-scale language modeling?We construct WIKITREES, a dataset of over 100 million tokens extracted from Wikipedia Articles Merity et al. (

Table 4 :
Kitaev et al. (2019)against binarized groundtruth parses from the PTB and BLLIP test sets.We filter all examples from the PTB test set with unknown words, giving us 2335 out of 2416 sentences.Annotations on BLLIP-LG are obtained usingKitaev et al. (2019).