Structural Guidance for Transformer Language Models

Transformer-based language models pre-trained on large amounts of text data have proven remarkably successful in learning generic transferable linguistic representations. Here we study whether structural guidance leads to more human-like systematic linguistic generalization in Transformer language models without resorting to pre-training on very large amounts of data. We explore two general ideas. The"Generative Parsing"idea jointly models the incremental parse and word sequence as part of the same sequence modeling task. The"Structural Scaffold"idea guides the language model's representation via additional structure loss that separately predicts the incremental constituency parse. We train the proposed models along with a vanilla Transformer language model baseline on a 14 million-token and a 46 million-token subset of the BLLIP dataset, and evaluate models' syntactic generalization performances on SG Test Suites and sized BLiMP. Experiment results across two benchmarks suggest converging evidence that generative structural supervisions can induce more robust and humanlike linguistic generalization in Transformer language models without the need for data intensive pre-training.


Introduction
Pre-trained Transformer architectures have led to huge progress in building more human-like language processing systems (Radford et al.;Devlin et al., 2019;Brown et al., 2020, among others). These models achieve impressive perplexity results on language modelling datasets, perform well on grammatical judgments (Warstadt et al., 2020), and provide useful linguistic representations that benefit a wide range of downstream tasks. Probing analyses also suggest that these models learn to implicitly encode syntactic information (Hewitt and Manning, 2019;Clark et al., 2019) that may support better linguistic generalization than recurrent neural network architectures (RNNs).
However, the Transformer architecture (Vaswani et al., 2017) is an interesting subject of study beyond its success in transfer-learning settings. Transformer models lack the inductive biases of RNNs. Rather than maintaining vector-valued state and updating it in a recurrent manner, auto-regressive Transformer models encode all past decisions simultaneously at each inference step, thanks to a self-attention mechanism. The only notion of sequence order is also given by position embeddings summed to content embeddings in both input and auto-regressive signals.
Previous works have shown the advantage of structural supervision in RNNs in learning to maintain syntactic states and non-local dependencies (Kuncoro et al., 2018;. It remains an open question whether Transformer language models can similarly benefit from generative structural supervision, and what form of structural supervision would more effectively induce human-like syntactic generalization. This work hypothesizes that the Transformer language model may benefit from explicit generative structural supervision to systematically generalize syntactic knowledge. Here we explore two major classes of structural guidance for Transformer language models based on joint modeling of language and constituency parses. The "generative parsing as language modeling" approach builds a Transformer-parameterized model to learn to predict actions that incrementally build constituency trees along with terminal words, following prior work on RNNs (Dyer et al., 2016;Choe and Charniak, 2016). The "structural scaffolding" approach follows the general idea of regularizing hidden representation through multi-task learning objective, with prior success in various NLP tasks (Zhang  and Weiss, 2016; Søgaard and Goldberg, 2016;Swayamdipta et al., 2018). We test these two approaches on two subsets of the BLLIP dataset (Charniak et al., 2000) and evaluate models' syntactic generalization performances on SG Test Suites (Hu et al., 2020) and a sampled subset of the BLiMP Benchmark (Warstadt et al., 2020). We show evidence that generative structural supervision indeed induces more robust and human-like linguistic generalization in Transformer language models and explore the different trade-offs involved in the presented methods.

Models
Here we explore joint modelling of structures and words parametrized with Transformers by considering both a sentence W and its constituency parse Y and modeling the joint distribution P (W, Y ).

Generative Parsing as Language Modeling
A language model can be described formally as a probability distribution over strings of a language w 1 , · · · , w T , usually left-to-right factored.
There are many possible approaches that can combine both language modeling and syntax modeling tasks. As long as both tasks share some of the parameters they can be considered a case of multi-task learning (Caruana, 1997). Of interest here is the model proposed in Recurrent Neural Network Grammars (RNNGs; Dyer et al., 2016) and parsing as language model (LSTM-LM; Choe and Charniak, 2016). Both approaches model the joint distribution of words W and constituency tree components Y as where a t are transitions of a state machine that generates both the sentence and the tree. These transitions are similar to the well-established transition sets used for transition-based parsing (Earley, 1970) but adapted to generate both text and parse simultaneously. For the reminder of this work, we will consider each a t to be integer valued and indexing a dictionary of transitions. A transition a can be a word w or a transition action that generates a component of the constituency tree y. The actions include non-terminal symbols that open and label a new constituent with the label x, indicated as NT(x), or a REDUCE action closing the closest open constituent. An example of a partial parse tree and transitions can be found at the top of Figure 1. RNNG and LSTM-LM parametrize the same factorization in Equation 2 in different ways. RNNG utilizes stack-LSTMs, which allow it to dynamically create representations for partial tree components by composition. The LSTM-LM, however, uses a flat parametrization treating the transitions as a sequence in a conventional language model learnt with an LSTM (Hochreiter and Schmidhuber, 1997). It should also be noted that the LSTM-LM is designed as a parser, while RNNG is also used as a language model. In order to derive a language model from a joint model, it is is necessary to marginalize over all possible parse trees which is an intractable problem since there is an exponentially large number of possible trees. The original RNNG work (Dyer et al., 2016) proposes an approximate solution based on importance sampling. In this work we use the word-synchronous beam search approximation introduced in Stern et al. (2017).
The marginalized likelihood language model in Equation 3 is desirable because it makes no statistical independence assumption between language and syntax and shares all parameters across both tasks, with the exception of action specific embeddings. Particularly relevant for this work is the fact that both word and non-word transitions are predicted as language model output indiscriminately and are available at each prediction step through its history a <t .
In this work we propose to parametrize Eq 2 with a Transformer language model (Vaswani et al., 2017). This is equivalent to the flat parametrization of the LSTM-LM but using a Transformer language model instead. Unlike LSTM-LM, which is a parsing model, we derive from it a language model by marginalization as in the RNNG. A Transformer language model can be succinctly described as a neural network of vertically stacked layers where the m-th layer is given by Here h m−1 <t ∈ R H×t is the output of the previous decoder layer for all previous predictions of the model at time step t and H is the size of the hidden vector. The input to the first layer i.e. h 0 <t are the embeddings of all previous transitions a <t concatenated with a start symbol. Each embedding is the sum of both a content embedding, dictionary vector that is being indexed, and a position embedding that encodes the absolute or relative position of each action in the sequence.
FF m () is a feed-forward layer, A m 1 () · · · A M N () are multiple self-attention heads and O ∈ R H×H is a matrix multiplication performed on the concatenated output of the attention heads. Both the feed-forward and the projection of N attention heads through O are wrapped around with residual, dropout and layer normalization operations that are here removed for clarity.
Each attention head comprises a simple inner product attention mechanism where V m n , K m n , Q m n ∈ R H/N ×H are value, key and query projection matrices respectively and the softmax operation is normalized over columns to sum to one. The matrix M ∈ {−∞, 0} t×t is used to prevent the model from attending to future states during training, enabling efficient parallelization. It is displayed here due to its relevance for the next section.
Similarly to other models, to derive a distribution over all possible transitions, including words, nonterminal symbols and the REDUCE operation, we can use a softmax together with an inner product where E W ∪Y are the embeddings for the joint vocabulary of words, non-terminals and REDUCE transitions. Henceforth, we refer to this model as Parsing as Language Model, or PLM for short. Unlike LSTMs or the RNNG, the Transformer has direct access to all past decisions through selfattention and relies on position embeddings to encode word order. Thus, in principle, there is no structural bias for the model to favor past decisions that are close in time to inform current prediction. On one hand, this potential ability to use long distance information can enable a less local, more human like processing of language, but on the other hand, it can also result in an additional learning burden, especially if there is not sufficient learning data available. Also worth noting for the experiments proposed here is that the total number of parameters of a typical Transformer greatly exceeds that of an LSTM or a RNNG model.

Incorporating RNNG-like characteristics
As previously mentioned, unlike any of the other models, the RNNG is able to create partial tree representations by composition using stack-LSTMs. This changes the RNNG model structure dynamically as a function of the partial parse, a very desirable property to derive syntax-aware representations. Moreover, the fact that Recurrent Neural Networks such as LSTMs summarize all information about previous time steps on two hidden vectors, creates a bottleneck that forces the model to focus on the local state. This is a situation where a syntax-aware representation can provide additional value by enabling the local state to better encompass past structures. We conjecture that a similarly constrained local state might benefit Transformer models in learning linguistic regularities, especially in a limited training data scenario.
In an attempt to capture a similar effect in the Transformer, we explore here the idea of masking some attention heads to reflect the parser state as in the stack-Transformer (Astudillo et al., 2020). In the stack-Transformer, two attention heads are specialized to attend only to the contents of buffer and stack respectively for dependency and semantic parsing tasks. Here we choose to specialize two heads as well for each layer in Equation 4, as depicted in Fig. 2. One attention head attends to the contents of the last open constituent whereas another head attends all other past decisions not involving that constituent. The rest of the heads are left free as in the original Transformer architecture. To constrain the attention heads, we only need to alter the mask M in Equation 5 to depend on head index n and past actions M n (a <t ), which results in a negligible computation overhead.
This hard masking makes the model structure change dynamically depending on the partial parse and it forces some heads to focus on the local syn-tactic state. Nevertheless, unlike the RNNG, it does not create new representations of partial parses that can be composed in a recurrent manner at each time step, and some attention heads can still operate unrestricted. We hypothesize that structure-aware attention mechanism may still help the model achieve better generalization. The symbolic representation induces a strong inductive bias to how the model should use the structure that it generates on the fly. We henceforth refer to this model PLM-mask.

Scaffolding by Learning to Predict Local Parse States
Given the strong coupling between the tasks, the marginal likelihood Transformer language model of the previous section can be expected to be strongly influenced by the additional syntax prediction task. This comes however at a big cost. First, sequences combine both words and non-terminal and reduce transitions, yielding longer sentences than those of a normal language model R > T . Furthermore the approximated marginalization is computationally intensive and also introduces an approximation error. One well-established regime that allows joint modeling of tasks at a low complexity is that of the syntactic scaffold (Zhang and Weiss, 2016;Søgaard and Goldberg, 2016;Swayamdipta et al., 2018). Scaffolding adds an additional structure prediction task at one of the layers of the model as a separate layer and only during training. This is a minimally intrusive change since it just branches some hidden vector of the network and computes an additional loss. It also has no influence on test runtime and avoids expensive steps such as marginalization.
However, applying the idea of syntactic scaffolding to our present scenario poses one difficulty. If we use a standard language model predicting words w and predict the non-word symbols y separately, we face the problem that the two sequences have different lengths. To overcome this in a straightforward way, we predict the n-gram of non-word actions y t:t+n(t) corresponding to the partial parse synchronous with step t when we predict word w t . We use a secondary softmax layer for this action n-gram prediction.
p(y t:t+n | y <t ) = softmax(E Y * · h m <t ) y t:t+n (7) Here E Y * is the vocabulary of all transition ngrams excluding words found in the train corpus plus a blank symbol. Note that since Scaffolding operates only at train time, we do not need to worry about generalization of these n-grams to test time.
The models are thus trained to minimize the loss The scaffold can be set so that the synchronous non-word action n-grams y t:t+n(t) are predicted either before (Figure 1c, left) or after (Figure 1c, right) producing w t . We considered both variants in our experiments to empirically assess their impact on performance. We refer to this model as Transformer Language Model with Syntactic Scaffold, or ScLM in short, and its two versions ScLM-past and ScLM-next, for past and next ngram prediction.

Model Training
All models, including the baseline vanilla language models (LM in short), the syntactic scaffold models, and the generative parsing models, are based on the same architecture of GPT-2 small (Radford et al.) (117M parameters, 12 layers, H = 768) and use the same BPE tokenizer, but with randomly initialized weights. We believe this would give us a fair comparison to pretrained GPT-2 as well, in order to evaluate whether structural guidance helps improve sample efficiency. We implemented all the proposed models using Huggingface's Transformer package (Wolf et al., 2020) 1 . As our goal here is to study whether structural guidance helps models learn robust humanlike generalization of syntactic knowledge, we train our model on the BLLIP dataset (Charniak et al., 2000), an English newswire style corpus used in Hu et al. (2020). This makes the results here more comparable to the results reported in previous work, especially with RNNGs. We train the proposed models and the baseline vanilla Transformer language models on BLLIP-MD, a 14 million-token corpus, and BLLIP-LG, a 46 million-token corpus, both of which are auto-parsed using a state-of-theart constituency parser (Kitaev and Klein, 2018). We used the parsed sentences to generate oracle parsing action sequence for PLM and PLM-mask. We collected a list of word-synchronous parsing action sequences from the train and development oracle of BLLIP-LG and use it to parametrize the action n-gram vocabulary of ScLMs trained on both BLLIP-MD and BLLIP-LG. There are 3756 action n-gram types from the corpora, including one padding token and one blank token.
All models were trained with learning rate 10 −5 , AdamW optimizer, and minibatch of size 5. We trained the models with multiple seeds within the capacity of our resources, in order to accommodate potential variance. In total, there are three seeds of LM, four of ScLM-past, four of ScLM-next, three of PLM, and three of PLM-mask for BLLIP-MD, and the same number of seeds of each model type for BLLIP-LG. Models were trained until convergence, as suggested by the loss of the development set during training.

Targeted Syntactic Evaluation
To assess whether a trained model systematically generalizes its syntactic knowledge, we employ targeted syntactic evaluation paradigm (Marvin and Linzen, 2018). Specifically, we measure models' performance on two held-out test datasets, a collection of syntactic generalization test suites from Hu et al. (2020) and BLiMP Benchmark from Warstadt et al. (2020). These two datasets cover a wide range of English syntactic phenomena.
Tests from Hu et al. (2020), which we refer as SG Test Suites, consist of hand-designed test suites for evaluating fine-grained syntactic generalization in incremental processing of a linguistic input. The general method is to compare models' surprisals p(continuation|prefix) of grammatical and ungrammatical continuations given certain sentence prefixes. We report the accuracy averaged across SG test suites. BLiMP Benchmark features minimal pairs of a grammatical sentence W and an ungrammatical counterpart W * . To evaluate a model on these minimal pairs, one simply compares the likelihood of W and W * assigned by the model.
As is implied by the evaluation methods, we need to marginalize out the structure variables for PLM or PLM-mask models in order to estimate the surprisal of a continuation, given a sentence prefix or the likelihood of a complete sentence. We follow similar setup as in ;  applying word-synchronous beam search (Stern et al., 2017) to find a list Y k of k incremental parses given a sentence prefix w <t . We then sum the joint probability p(w <t , y <t ) over the list of incremental parses given by the model to approximate the likelihood of p(w <t ). We set the parse beam size to 100, word-synchronous beam size k as 10, and fast track size of 5. Since the search process can be computationally intensive, the large number of items in BLiMP benchmark poses a computational challenge. We therefore select the first 10% out of the 1000 items in each of the 67 tests of BLiMP Benchmark. We report the accuracy over the 100 items and refer to this down-sized BLiMP Benchmark as BLiMP-10%.
We compare models' performance on the SG Test Suites and BLiMP-10% in Figure 3. Each bar shows a model's performance averaged across multiple seeds on a given benchmark, with each dot plotting the accuracy of a specific seed. Overall, syntactic generalization performance improves as the training data size increases from BLLIP-MD (14 million tokens) to BLLIP-LG (42 million tokens). Models with structural guidance achieve higher accuracy than the vanilla Transformer language model trained on the same set of raw text data without explicit structural information. We also include the results for the RNNGs taken from Hu et al. (2020). RNNG lags behind all Transformer models by a large margin in average scores. We also notice that among different forms of structural guidance, generative parsing as language modeling is the most effective in improving syntactic generalization performance against the baseline transformer language models. We didn't observe consistent benefits of adding dynamic masking mechanism to PLM. While scaffolding approach slightly improves vanilla Transformer language models, it still falls behind the best performance of the model trained with generative parsing. We hypothesize that our scaffold did not fully exploit the compositional structure in the local parses by modelling each action n-gram as a distinct type, while the generative parsing models only predict actions in a relatively small set of non-terminal action space, which might make it easier for PLM and PLM-mask to learn compositional generalization. We leave it for future work to design new scaffolds that can take advantage of the combinatorial nature of syntactic structure.
For completeness, we also ran the pre-trained GPT-2 model on the syntactic suites. This yielded a score of 0.808 on the SG Test Suites and 0.827 on BLiMP-10% for the small version of pre-trained GPT-2. Among models trained on BLLIP-LG, the average accuracy score on the SG Test Suites is 0.723 for PLMs, 0.748 for PLM-masks, and 0.665 for LMs. Similar trend is observed on BLiMP-10% as well, where among models trained on BLLIP-LG the average accuracy is 0.751 for PLMs, 0.753 for PLM-masks, and 0.708 for LMs. The proposed PLM method is able to close the gap between GPT-2 small and the same model trained with BLLIP-LG by about half, while the improvement for BLiMP is more modest but still significative. It remains an open question whether scaling syntactic supervision to a larger dataset than BLLIP-LG would bring the generalization performance of PLM models closer to that of the pretrained GPT-2 model.

Relationship between Perplexity and Syntactic Generalization Performance
We compare perplexity on the BLLIP held-out test set against syntactic generalization performance in Figure 4. Perplexities of PLM and PLM-mask models are computed setting the parse tree equal to the gold parse in Equation 3  our models use the same BPE vocabulary and word tokenization from GPT-2. The only exception are the additional parsing actions in the vocabulary y. From Figure 4, both perplexity and syntactic generalization performance improve with dataset size. However, for both training dataset sizes, we see that structural guidance can improve syntactic generalization. PLM models consistently perform better than vanilla models. While all models achieve very similar perplexity results after being trained on a specific dataset, their syntactic generalization performances differ dramatically.

Effect of Structural Guidance on Learning Specific Syntactic Structures
In addition to comparing model's aggregated performances, we also compare their generalization performances in the clustered subsets of tests in SG Test Suites and BLiMP-10%. These subsets consist of several related tests that target specific type of syntactic phenomenon, such as NPI licensing, subject-verb agreement, filler-gap dependencies, etc. We also include the results for the RNNGs taken from Hu et al. (2020). Results in Figure 5 show converging evidence that structural guidance in the form of generative parsing can robustly improve learning of subjectverb agreement and NPI licensing, and helps the model to better capture incremental processing phenomenon such as garden-path effects, but seems to slightly hurt the performance on gross syntactic state. While overall the RNNG shows a poor performance this is mostly due to its very low scores for licensing suites. Excluding these suites only the RNNG shows a performance close to the PLM model, even outperforming it clearly for the gross syntactic state suites. In this category and binding PLM variants seem inferior to all other models.

Related Work
Multitask learning (Caruana, 1997) has been applied to a variety of NLP tasks with traditional modeling approaches (Miller et al., 2000;Sutton and McCallum, 2005;Sutton et al., 2007) as well as more recent neural models (Collobert et al., 2011;Li et al., 2020a). A recurring theme has been the use of structure in the form of syntactic trees to benefit other NLP tasks. Among the early works exploring this direction, Punyakanok et al. (2008) showed that syntactic parses can benefit Semantic Role Labeling (SRL). Poon and Domingos (2009) extended this idea to induce first-order logic representation in a unsupervised fashion, by clustering the dependency structures. In both cases syntax forms part of a pipeline and is not strictly supervision for the end task.
This trend continued with the rise of neural models. Collobert et al. (2011) improved deep convolution neural network for syntactic chunking models with additional POS supervision. Zhang and Weiss (2016); Søgaard and Goldberg (2016)  (2020) incorporate a syntactic graph recurrent neural network into BERT models for better semantic role labeling. However, their method shows little or no benefit of syntax modeling for Named Entity Recognition and relation linking task. Neural machine translation (Chen et al., 2018) and text generation (Li et al., 2020a) have also been shown to benefit from syntactic modeling. In a recent work, Li et al. (2020b)   on several text classification benchmarks. Other works have found that structural supervision in the form of intermediate fine-tuning (e.g., on CCG super tagging) is not helpful or even harmful (Pruksachatkun et al., 2020;Warstadt et al., 2019). The focus of our work is on gauging the impact of joint modeling on syntactic generalization performance. In this direction, the work of Swayamdipta et al. (2018) is close to the scaffolding version of our model. They predict multiple labels, extracted from syntactic information, as auxiliary task and show positive effects on shallow semantic parsing and co-reference resolution. We use however a single feature, constituency parsing n-gram, which is closer to prior work relying on Part-of-Speech information. In addition, we explore impact of using preceding structure as feature vs postceding structure, which as shown plays a role in the learning process.
In terms of modeling objective and syntactic rep-resentations, our method is closest to the works of Choe and Charniak (2016); Dyer et al. (2016) that jointly model syntax and language. A more recent work from  uses Rational Neural Networks language model that can derive binary unlabeled constituents from attention weights and can supervise the attention to attain a structural inductive bias. The proposed models show lower language modeling perplexity compared to their structure agnostic counterparts. We also extend here the idea of syntax-aware language modeling to transformer-based language models.
Finally, our approach relates to the other works that propose ways of incorporating structural information into Transformer-based models. This includes the use of dependency or tree structure for constraining self-attention patterns (Strubell et al., 2018;Wang et al., 2019;, guiding cross-attention (Chen et al., 2018;Astudillo et al., 2020), modelling syntactic distance (Du et al., 2020), using syntactic information to guide the computation flow in the model (Shen et al., 2021), or through knowledge distillation (Kuncoro et al., 2020). Our structured masking in parsing as language modeling approach is close in spirit to the methods that modify attention mechanism according to syntactic connections (Astudillo et al., 2020); This work, however, primarily aims to study the impact of structural guidance on syntactic generalization. Therefore, we resort to simpler methods of incorporating structure to minimize the impact of modeling intricacies.

Conclusion
Our work explores two forms of syntactic supervision as structural guidance for Transformer language models. Experiments suggest that generative parsing approach can effectively improve systematic generalization of learned syntactic knowledge in small training data regime, while a naive syntactic scaffold approach does not improve the baseline to the same extent despite reduced computation cost at inference time. Future work may explore alternative structural guidance strategies that combine the best of both approaches.