Do Transformers Parse while Predicting the Masked Word?

Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with $50$ layers, $15$ attention heads, and $1275$ dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with $>70\%$ F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.


Introduction
One of the surprising discoveries about transformer-based language models like BERT [Devlin et al., 2019] and RoBERTa [Liu et al., 2019] was that contextual word embeddings encode information about parsing, which can be extracted using a simple "linear probing" to yield approximately correct dependency parse trees for the text [Hewitt andManning, 2019, Manning et al., 2020].Subsequently, Vilares et al. [2020], Wu et al. [2020], Arps et al. [2022] used linear probing also to recover information about constituency parse trees.The current paper focuses on the ability of BERT-style transformers to do constituency parsing, specifically for Probabilistic Context-Free Grammars (PCFGs).As noted in [Bhattamishra et al., 2020, Pérez et al., 2021], transformers are Turing complete and thus certainly capable of parsing.But do they parse while trying to do masked-word prediction?
One reason to be cautiously skeptical is that naive translation of constituency parsing algorithms into a transformer results in transformers with a number of heads that scales with the size of the grammar (Section 3.1), whereas BERT-like models have around a dozen heads.This leads to the following question.
(Qs 1): Are BERT-like models capable of parsing with a realistic number of heads?This is not an idle question: recently Maudslay and Cotterell [2021] suggested that the linear probing was using semantic cues to do parsing.They constructed semantically meaningless but syntactically correct sentences and observed a large drop in parsing performance via linear probes compared to earlier papers.
(Qs 2): Do BERT-like models trained for masked language modeling (MLM) encode syntax, and if so, how and why?

This paper
To understand Qs 1, we first construct an attention model that executes the Inside-outside algorithm for probabilistic context-free grammar (PCFG) (Section 3.1).If the PCFG has N non-terminals and the length of the sentence is L, our constructed attention model has 2L layers in total, N attention heads in each layer, and 2N L embedding dimensions in each layer.However, this is massive compared to BERT.For PCFG learned on Penn Treebank (PTB) [Marcus et al., 1993], N = 1600, average L ≈ 25, which leads to an attention model with 80k embedding dimension, depth 50, and 1.6k attention heads per layer.By contrast, BERT has 768 embedding dimensions, 12 layers, and 12 attention heads per layer!One potential explanation could be that BERT does not do exact parsing; merely that it computes some of the information relevant to parsing.After all, linear probing didn't recover complete parse trees.It recovered trees with modest F1 score, such as 78.2% for BERT [Vilares et al., 2020] and 82.6% for RoBERTa [Arps et al., 2022].To the best of our knowledge, there has been no study of parsing methods that strategically discard information to do approximate parsing in a more resource-efficient manner.Toward this goal, we design an approximate version of the Inside-Outside algorithm (Section 3.3), which is executable by an attention model with 2L layers, 15 attention heads, and 40L embedding dimensions, while still achieving > 70% F1 score for constituency parsing on PTB dataset [Marcus et al., 1993].
The above construction shows that realistic architectures are capable of capturing a fair bit of parsing information.But this still begs the question of whether or not they need to do so for masked language modeling.After all, Maudslay and Cotterell [2021] suggested that linear probing of MLMs is picking up on semantic information that simply happens to correlate with parse trees.To better understand this, we tried to take semantics out of the picture as follows: Generate synthetic text according to a PCFG that was fitted to English text; then train a (masked) language model on the synthetic text.This is a more rigorous way to focus on separating syntax from semantics than the more ad hoc method of Maudslay and Cotterell [2021].Section 3.2 notes that given such synthetic text, the Inside-Outside algorithm will minimize MLM loss.Note that parsing could in principle be done by other algorithms like CYK [Kasami, 1966], but there is no explicit connection between CYK and the masked language modeling.Our experiments show that the language models pre-trained from the synthetic data still contain syntactic information -simple probing methods recover reasonable parse tree structure (Section 4.2), though interestingly, the quality is better when probing using a 2-layer net instead of linear probes.More interestingly, probes of contextualized embeddings show (Section 4.3) that they contain information correlated with the information computed in the Inside-Outside algorithm.This suggests that attention models do indeed implicitly do some form of approximate parsing -in particular a process related to the Inside-Outside algorithm-to achieve low MLM loss.

Attention
For the rest of the paper, we will focus on encoder-only models like BERT and RoBERTa [Devlin et al., 2019, Liu et al., 2019].An encoder-only model stacks multiple identical layers, where each layer contains an attention module, followed by a feed-forward module.We focus our interest on the attention module.Each attention module is comprised of multiple heads, where the computation of each head h is represented by three matrices Given an input sequence [w 1 , • • • , w L ] of length L, we will denote the contextual embeddings of the sequence after layer 's computations using E ( ) ∈ R L×d , where e ( ) i denotes the contextual embedding of i th token.The computation of an attention head h at layer is given by v where a h i,j is defined as attention score between e i and e j with head h and is given by f attn is a non-linear function and is generally used as softmax activation on E ( ) K h Q h e ( ) i .Finally, the output of the attention module is given by h v ( ) i,h .Note that the above definition of attention captures the attention module used in practice, where the contextual embeddings are split equally among the different attention heads, and the outputs of each head are simply merged after individual head computations.For ease of presentation, we will use this general definition henceforth.

PCFG and parsing
PCFG model A probabilistic context-free grammar (PCFG) is a formal language generative model.It is defined as a 5-tuple G = (N , I, P, n, p), where • N is the set of non-terminal symbols in the grammar.I ⊂ N is a finite set of in-terminals.P ⊂ N is a finite set of pre-terminals.We assume that N = I ∪ P, and I ∩ P = φ.
• [n] is the set of all possible words.
• For all A ∈ I, B ∈ N , C ∈ N , there is a context-free rule A → BC.
• For rule A → BC where • For all A ∈ P, w ∈ [n], there is a context-free rule A → w.
• For each rule A → w where A ∈ P, w ∈ [n], there is a probability Pr[A → w].The probabilities satisfy for all A, w Pr[A → w] = 1.
• A non-terminal Root ∈ I.
Strings are generated from the PCFG as follows: We always maintain a string s t ∈ ([n] ∪ N ) * at step t.The initial string s 1 = ROOT.At step t, if all characters in s t belong to [n], the generation process ends, and s t is the resulting string.Otherwise, we pick a character A ∈ s t such that A ∈ N .Parse trees and parsing For a sentence s = w 1 w 2 . . .w L with length L, a labeled parse tree represents the most probable list of derivations that lead to the generation of the sentence under the PCFG G.It is defined as a list of spans with non-terminals {(A, i, j)} that forms a tree involving all the spans {i, j} i,j∈ [L] .An unlabelled parse tree is a list of spans (without the non-terminals) such that it forms a tree.We also call the tree binary if the resulting tree only has at most two children for each node.
Given a sentence s and the PCFG model, one famous procedure to find the (unlabelled) parse tree is to use the Labelled-Recall algorithm [Goodman, 1996], where the algorithm finds the tree We use the Inside-Outside algorithm [Baker, 1979, Manning andSchutze, 1999] to compute the marginal probabilities.The Inside-Outside algorithm uses dynamic programming to compute two probability terms, the inside probabilities α(A, i, j) = Pr[A → w i w i+1 . . .w j |G] and the outside probabilities β(A, i, j) = Pr[Root → w 1 w 2 . . .w i−1 Aw j+1 . . .w L |G] for all non-terminals and all spans (i, j) with 1 ≤ i ≤ j ≤ L. Specifically, the recursive relation is given by (2) with the base cases α(A, i, i) = Pr[A → w i ] for all A, i and β(Root, 1, L) = 1 for all A. The marginal probabilities are then given by The performance of parsing is measured by the unlabelled F1 score, which is the F1 score on the prediction of spans.However there are two different F1 scores depending on the average: Sentence F1, which is the average of F1 of each sentence, and Corpus F1, which is computed by the total true positives, false positives, and false negatives.

Probing
A probe f (•) is just a supervised model, where there is some input x to the probe f (•) and we want the probe f (x) to predict some target tar(x).For example, Hewitt and Manning [2019] used the probe f (•) to predict the tree distance tar(i, j) = d T (i, j) between words, where the tree distance d T (i, j) is the number of edges between word i and j in the dependency parse tree T .The input of f (•) is e j , which is the difference between the embeddings at layer of words at position i and j, and f (e is the square of a linear function and B is the trainable parameter.Besides the tree distance, one can change the probe to predict other information (i.e. the target), change the input, or even change the function class of the probe f .Despite the mathematical equivalence of a probe and a supervised model (or a parser in the parsing setting), the goals of a probe and a supervised model (or a parser) are different.The goal of a model (or a parser) is to get a high prediction score, while the goal of a probe is to identify the existence of certain information intrinsically stored in the embeddings [Maudslay et al., 2020, Chen et al., 2021].Thus, we would like to restrict the power of probes such that it is "sensitive" to the information we want to probe.For example in the parsing setting, ideally, the probing performance should be low on un-contextualized embeddings (e.g., embeddings at the 0th layer of our models) and high on contextualized ones.

Parsing using Attention Models
In this section, we construct attention models with a moderate number of layers and heads that can perform parsing tasks and optimize masked language model losses.In Section 3.1, we show that given any PCFG, there exist attention models that can execute the Inside-Outside algorithm using this PCFG for bounded-length sentences.Then in Section 3.2, we connect our construction with masked language modeling by showing Inside-Outside algorithm is optimal for masked language modeling on data generated from the PCFG model.Finally, in Section 3.3, we show that the size of these constructions can be made even smaller while maintaining most of their performances in parsing.

Attention model can execute Inside-Outside algorithm
In this section, we give several constructions that show even moderate-sized attention models are expressive enough to represent the Inside-Outside algorithm.
We first give a construction (Theorem 3.1) that relies on hard attention, where only one of the attended positions will have positive attention score.For this construction, we define f attn : R L×d × R d such that the attention scores in eq. 1 are given by This is similar to softmax attention used in practice, with softmax replaced by ReLU activation.
Theorem 3.1 (Hard attention).Given a PCFG G = (N , I, P, n, p), there exists an attention model with hard attention modules (5), embeddings of size (4|N | + 1)L, 2L − 1 layers, and 4|N | attention heads in each layer, that can simulate the Inside-Outside algorithm on all sentences of length at most L generated from G, and embed all the inside and outside probabilities.
Proof.The first L − 1 layers simulate the recursive formulation of the Inside probabilities from eq. 2, and the last L − 1 layers simulate the recursive formulation of the outside probabilities from eq. 3. The model uses embeddings of size 4|N |L + L, where the last L coordinates serve as one-hot positional embeddings and are kept unchanged throughout the model.
Notations: For typographical simplicity, we will divide our embeddings into 5 sub-parts.We will use the first 2|N |L coordinates to store the inside probabilities, the second 2|N |L coordinates to store the outside probabilities, and the final L coordinates to store the one-hot positional encodings.For every position i and span length + 1, we store the inside probabilities {α(A, i, i + )} A∈N after computation in its embedding at coordinates respectively.For simplicity of presentation, we won't handle cases where i + or i − is outside the range of 1 to L -those coordinates will be fixed to 0.
Token Embeddings: The initial embeddings for each token w will contain Pr[A → w] for all A ∈ P.This is to initiate the inside probabilities of all spans of length 1.Furthermore, the tokens will have a one-hot encoding of their positions in the input in the last L coordinates.
Inside probabilities: The contextual embeddings at position i after the computations of any layer < L contains the inside probabilities of all spans of length at most + 1 starting and ending at position i, i.e. α(A, i, i + k) and α(A, i − k, i) for all A ∈ N and k ≤ .The rest of the coordinates, except the position coordinates, contain 0.
Layer 1 ≤ < L: At each position i, this layer computes the inside probabilities of spans of length + 1 starting and ending at i, using the recursive formulation from eq. 2.
For every non-terminal A ∈ N , we will use a unique attention head to compute α(A, i, i + ) at each token i.Specifically, the attention head representing non-terminal A ∈ N will represent the following operation at each position i: where j = i + .In the final step, we modified the formulation to represent the interaction of spans of different lengths starting at i and ending at j.We represent this computation as the attention score a i,j using a key matrix K ( ) A and query matrix Q Computing Eq. 6 We set the Key matrix K ( ) A is set such that if we define A to P A for some 0 ≤ 1 , 2 ≤ , with the rest set to 0, we can get for any two positions i, j, Because we want to involve the sum over all 1 , 2 pairs with 1 + 2 = − 1, we will set blocks at positions {(|N |(L + 2 ), |N | 1 )} 1, 2: 1+ 2= −1 to P A , while setting the rest to 0. This gives us However, we want (K ( −1) i to compute α(A, i, j) iff j = i + and 0 otherwise, so we will use the final block in Q ( ) A that focuses on the one-hot position encodings of i and j to differentiate the different location pairs.Specifically, the final block Q p will return 0 if j = i + , while it returns −ζ for some large constant ζ if j = i + .This gives us With the inclusion of the term ζ(I[j − i = ] − 1), we make (K ( −1) i positive if j − i = , and negative if j − i = .Applying a ReLU activation on top will zero out the unnecessary terms, leaving us with α(A, i, i + ) at each location i.
Similarly, we use another |N | attention heads to compute α(A, i − , i).In the end, we use the residual connections to copy the previously computed inside probabilities α(A, i − , i) and α(A, i, i + ) for < .
Outside probabilities: In addition to all the inside probabilities, the contextual embeddings at position i after the computations of any layer (2L − 1) − (≥ L) contain the outside probabilities of all spans of length at least + 1 starting and ending at position i, i.e. β(A, i, i + k) and β(A, i − k, i) for all A ∈ N and k ≥ .The rest of the coordinates, except the position coordinates, contain 0.
Layer L In this layer, we initialize the outside probabilities β(ROOT, 1, L) = 1 and β(A, 1, L) = 0 for A = ROOT.Furthermore, we move the inside probabilities α(A, i + 1, i + k) from position i + 1 to position i, and α(A, i − k, i − 1) from position i − 1 to position i using 2 attention heads.
At each position i, this layer computes the outside probabilities of spans of length + 1 starting and ending at i, using the recursive formulation from eq. 3. The recursive formulation for β(A, i, i + ) for a non-terminal A ∈ N has two terms, given by where j = i + .For each non-terminal A ∈ N , we will use two unique heads to compute β(A, i, i + ) , each representing one of the two terms in the above formulation.We outline the construction for β 1 ; the construction for β 2 follows similarly.
Computing Eq. 8 We build the attention head in the same way we built the attention head to represent the inside probabilities in eq. 7. Similar to 7, we modify the formulation of β 1 to highlight the interaction of spans of different lengths.
where j = i + .We represent this computation as the attention score a i,i+ using a key matrix K ( ˜ ) A,1 and query matrix Q

Intuition for Q ( ˜ )
A,1 : For position i and any ranges 1 with the rest set to 0, we can get for any two positions i, j, Because we want to include the sum over 1 , 2 pairs with 2 − 1 = , we will only set blocks at positions )) for all 0 ≤ 1 , 2 ≤ L that satisfy 2 − 1 = to P A,r , while setting the rest to 0. This gives us Because we want (K to compute β 1 (A, i, j) with j = i + and 0 otherwise, we will use the final block in Q ( ) A that focuses on the one-hot position encodings of i and j to differentiate the different location pairs.Specifically, the final block Q p will return 0 if j = i + , while it returns −ζ for some large constant ζ, if j = i + .This gives us Applying a ReLU activation on top will zero out the unnecessary terms, leaving us with β 1 (A, i, i + ) at each location i.
Besides, we also need 2|N | additional heads for the outside probabilities β(A, i − , i).In the end, we use the residual connections to copy the previously computed inside probabilities β(A, i − , i) and α(A, i, i + ) for > .
The use of hard attention simplifies the intuition of construction, but it doesn't fully leverage the power of attention models.Next, we show that it is possible to reduce the embedding size and the number of attention heads by introducing relative positions and use soft attention (where multiple attention positions can have a nonzero attention score).We introduce 2L + 1 relative position vectors {p t ∈ R d } −L≤t≤L , and relative position biases {b t, ∈ R} −L≤t≤L,1≤ ≤2L−1 that modify the key vectors depending on the relative position of the query and key tokens.For an attention head h in layer , the attention score a h i,j is given by Theorem 3.2 (Relative positional embeddings).Given a PCFG G = (N , I, P, n, p), there exists an attention model with soft relative attention modules (11), with embeddings of size 2|N |L + 1, 2L layers, and |N | attention heads in each layer, that can simulate the Inside-Outside algorithm on all sentences of length at most L generated from G, and embed all the inside and outside probabilities.
The proof of the above theorem is in Appendix B.1.After we execute the Inside-Outside algorithms and get the inside and outside probabilities for spans, one can directly build the parse tree by applying the Labelled-Recall algorithm [Goodman, 1996], which can be viewed as a sort of "probe".

Masked language modeling for PCFG
The Inside-Outside algorithm not only can perform parsing but also has a connection with masked language modeling, the pre-training loss used by BERT.The following theorem shows that, if the language is generated from a PCFG, then the Inside-Outside algorithm is optimal to predict the masked tokens.
Theorem 3.3.Assuming that the language is generated from a PCFG, the Inside-Outside algorithm reaches the optimal masked language modeling loss.
Because the Inside-Outside algorithm is optimal for masked language modeling loss on synthetic PCFG data, we conjecture that if the model is pre-trained on synthetic PCFG data, it will implicitly embed the Inside-Outside algorithm (or the quantities computed by the Inside-Outside algorithm) in the model itself.This allows its intermediate layers to encode syntactic information useful for parsing.We verify this conjecture in Section 4.3.This conjecture may also explain why large language models pre-trained on natural language contain structural information [Hewitt and Manning, 2019, Vilares et al., 2020, Arps et al., 2022] Proof of Theorem 3.3.We first focus on 1-mask predictions, where given an input of tokens w 1 , w 2 , • • • , w L , and a randomly selected index i, we need to predict the token at position i given the rest of the tokens, i.e.Pr{w|w −i }.Under the generative rules of the PCFG model, we have Note that Pr[A → w] can be extracted from the PCFG and {β(B, i, i)} B∈N can be computed by the Inside-outside algorithm.Thus, Inside-outside can solve the 1-masking problem optimally.Now we consider the case where we randomly mask m% (e.g., 15%) of the tokens and predict these tokens given the rest.In this setting, if the original sentence is generated from PCFG G = (N , I, P, n, p), one can modify the PCFG to get G = (N , I, P, n + 1, p ) with n + 1 denote the mask token text[M ASK] and for each preterminal Then, the distribution of the randomly masked sentences follows the distribution of sentences generated from the modified PCFG G .Similar to the 1-masking setting, we can use the Inside-outside algorithm to compute the optimal token distribution at a masked position.

Towards realistic size
In Section 3.1, we show the construction of an attention model to execute the Inside-Outside algorithm for any PCFG.However, for a PCFG learned on the PTB dataset which contains sentences of average length 25, we need an attention model with 1600 attention heads, 3200L embedding dimension, and 2L layers to simulate the Inside-Outside algorithm on sentences of length L. The constructed model is extremely large Figure 1: Plot for the frequency distribution of in-terminals (I) and pre-terminals (P).We compute the number of times a specific non-terminal appears in a span of a parse tree in the PTB training set.We then sort the non-terminals according to their normalized frequency and then show the frequency vs. index plot.compared with BERT, which raises the question of whether our construction sheds any light on real-world architectures.In this section, we give positive evidence in this regard and show that we can utilize different approximation techniques to reduce the number of attention heads and the width of the embeddings in the constructed model, while still maintaining reasonable parsing performance.We heavily utilize the underlying sparsity in the structure of the English PCFG, which we point out below.The details have been deferred to Appendix C.
First ingredient: finding important non-terminals In the constructions of Theorems 3.1 and 3.2, the number of attention heads and embedding dimensions depend on the number of non-terminals of the PCFG.Thus if we can find a smaller PCFG, we can make the model much smaller.Specifically, if we only compute the probabilities of a specific set of in-terminals Ĩ and pre-terminals P in eq. 2 and 3, we can reduce the number of attention heads from |N | to max{| Ĩ|, | P|}. 1 Our hypothesis is that we can indeed focus only on a few non-terminals while retaining most of the performance.
Hypothesis 3.4.For the PCFG G = (N , I, P, n, p) learned on the English corpus, there exists Ĩ ⊂ I, P ⊂ P with | Ĩ| |I|, | P| |P|, such that simulating Inside-Outside algorithm with Ĩ ∪ P non-terminals introduces small error in the 1-mask perplexity and has minimal impact on the parsing performance of the Labeled-Recall algorithm.
We empirically verify our hypothesis through experiments.To find candidate sets Ĩ, P for our hypothesis, we check the frequency of different non-terminals appearing at the head of spans in the parse trees of the PTB [Marcus et al., 1993] training set.We consider the Chomsky-transformed (binarized) parse trees for sentences in the PTB training set, and collect the labeled spans {(A, i, j)} from the parse trees of all sentences.For all non-terminals A, we compute freq(A), which denotes the number of times non-terminal A appears at the head of a span.Figure 1 shows the plot of freq(A) for in-terminals and pre-terminals, with the order of the non-terminals sorted by the magnitude of freq(•).We observe that an extremely small subset of non-terminals have high frequency, which allows us to restrict our computation for the inside and outside probabilities to the few top non-terminals sorted by their freq scores.We select the top frequent non-terminals as possible candidates for forming the set Ñ .We show the unlabelled F1 scores on PTB development set as well as the 1-masking perplexity.Ĩ ( P) denotes the set of in(pre)-terminals to compute and are selected to be the in(pre)-terminals with top frequency.The PCFG is learned on PTB training dataset.The ppl. column denote the 1-masking perplexity on 200 sentences generated from the learned PCFG.
We verify the effect of restricting our computation to the frequent non-terminals on the 1-mask perplexity and the unlabeled F1 score of the approximate Inside-Outside algorithm in Table 1.Recall from Theorem 3.3, the 1-mask probability distribution for a given sentence w 1 , • • • , w L at any index i is given by Equation ( 12), and thus we can use Equation ( 12) to compute the 1-mask perplexity on the corpus.To measure the impact on 1-mask language modeling, we report the perplexity of the original and the approximate Inside-Outside algorithm on 200 sentences generated from PCFG.
We observe that restricting the computation to the top-40 and 45 frequent in-terminals and pre-terminals leads to < 6.5% increase in average 1-mask perplexity.Furthermore, the Labeled-Recall algorithm observes at most 4.24% drop from the F1 performance of the original PCFG.If we further restrict the computation to the top-20 and 45 in-terminals and pre-terminals, we can still get 71.91%sentence F1 score, and the increase in average 1-mask perplexity is less than 8.6%.However, restricting the computation to 10 in-terminals leads to at least 15% drop in parsing performance.Thus combining Theorem 3.2 and Table 1, we have the following informal theorem.
Theorem 3.5 (Informal).Given the PCFG G = (N , I, P, n, p) learned on the English corpus, there exists an attention model with soft relative attention modules (11), with embeddings of size 275 + 40L, 2L + 1 layers, and 20 attention heads in each layer, that can simulate an approximate Inside-Outside algorithm on all sentences of length at most L generated from G, introducing 9.29% increase in average 1-mask perplexity and resulting in at most 8.71% drop in the parsing performance of the Labeled-Recall algorithm.
If we plug in the average length L ≈ 25 for sentences in PTB, we can get a model with 20 attention heads, 1275 hidden dimension, and 51 layers.Compared with the construction in Theorem 3.2, the size of the model is much closer to reality.Besides, using this approximation doesn't affect the parsing performance a lot: compared with parsing using the Inside-Outside algorithm that achieves 75.90% corpus F1 and 78.77% sentence F1 on PTB dataset, the approximated computation shows a drop by 8.71%.The parsing score is still highly non-trivial, since the naive baseline, Right Branching (RB), can only get < 40% sentence and corpus F1 scores on PTB dataset2 .
Second ingredient: utilizing structures across non-terminals In both Theorem 3.2 and Theorem 3.5, we still assign one attention head to represent the computation for a specific non-terminal, which does not utilize possible underlying structures between different non-terminals.Next, we give the intuition of the second approximation method that utilizes the possible hidden structure between non-terminals.
Recall that in the proof of Theorem 3.1, we use one attention head at layer to compute the inside probabilities α(A, i, j) with j − i = (see eq. 7).Note that if the inside probabilties α(A, i, j) for different non-terminals A ∈ Ĩ lie in a k ( ) -dimensional subspace with k ( ) < | Ĩ|, we can compute all of the inside probabilities α(A, i, j) using only k ( ) attention heads instead of | Ñ | by computing the vector W ( ) α(i, j), where W ( ) ∈ R k ( ) ×| Ĩ| is the transformation matrix and α(i, j) ∈ R  ( ) .For the baselines, we only compute the probabilities for the important non-terminals.We show the parsing F1 results on PTB development set as well as the 1-masking perplexity on 200 sentences generated from the PCFG.
probabilties {α(A, i, j)} A∈ Ĩ .3Although the probabilities should not lie in a low dimensional subspace in reality, we can still try to learn a transformation matrix W ( ) ∈ R k ( ) ×| Ĩ| and approximately compute the inside probabilities by α(i, j) = (W ( ) ) † W ( ) α * (i, j) for j −i = , where α * (i, j) is computed using eq.7. The same procedure can also be applied to the computation of outside probabilities.Please refer to Appendix C.2 for more details on how we perform the approximated computations.We hypothesize that we can indeed find such transformation matrices {W ( ) } ≤L that can reduce the computations while retaining most of the performance.
Hypothesis 3.6.For the PCFG G = (N , I, P, n, p) learned on the English corpus, there exists transformation matrices W ( ) ∈ R k ( ) ×| Ĩ| for every ≤ L, such that approximately simulating the Inside-Outside algorithm with {W ( ) } ≤L introduces small error in the 1-mask perplexity and has minimal impact on the parsing performance of the Labeled-Recall algorithm.
We verify our hypothesis through experiments.We learn the matrix W ( ) that captures the correlation of the non-terminals in Ĩ for spans with length + 1.For a single sentence s and a specific span with length + 1, we can compute the marginal probability of this span of each non-terminal µ(A, i, j) = α(A, i, j) × β(A, i, j) for all A ∈ Ĩ.Then denote µ i,j s ∈ R | Ĩ| as the vector representation of the marginal probabilities for this span.We can compute X ( ) s = i,j:j−i= µ i,j s (µ i,j s ) as a matrix to capture the correlation of in-terminals Ĩ for spans with length + 1 given a sentence s.Then, we sum over the sentences in the PTB training set and get the normalized correlations X . Finally, we apply the Eigen-decomposition on X and set W ( ) to contain the Eigen-vectors with top k ( ) Eigen-values.Please refer to Appendix C.2 for more discussions on finding the transformation matrices {W ( ) } ≤L .
Table 2 shows the parsing results and the 1-masking perplexity that utilizes the hidden structure {W ( ) } ≤L with different k ( ) .Without utilizing hidden structure, if we only compute the probabilities for top-10 interminals, we can only get 60.32% sentence F1 on PTB.After utilizing the hidden structures, we can get 71.33% sentence F1 on PTB with a model that has only 15 attention heads, and 65.31% sentence F1 with a model that only has 10 attention heads.The following informal theorem summarizes the results.
Theorem 3.7 (Informal).Given the PCFG G = (N , I, P, n, p) learned on the English corpus, there exists an attention model with soft relative attention modules (11), with embeddings of size 275 + 40L, 2L + 1 layers, and 15 attention heads in each layer, that can simulate an approximate Inside-Outside algorithm on all sentences of length at most L generated from G, introducing 8.6% increase in average 1-mask perplexity (eq.12) and resulting in at most 9.45% drop in the parsing performance of the Labeled-Recall algorithm.
Compared with the parsing results from Theorem 3.5, the corpus and sentence F1 scores are nearly the same, and we further reduce the number of attention heads in each layer from 20 to 15.If we only use 10 attention heads to approximately execute the Inside-Outside algorithm, we can still get 61.72%corpus F1 and 65.31% sentence F1 on PTB dataset, which is still much better than the Right-branching baseline.Theorem 3.7 shows that attention models with a size much closer to the real models (like BERT or RoBERTa) still have enough capacity to parse decently well (>70% sentence F1 on PTB).
It is also worth noting that approximately executing the Inside-Outside algorithm using the transformation matrices {W ( ) } ≤L is very different from reducing the size of the PCFG grammar, since we use different matrix W ( ) when computing the probabilities for spans with different length.If we choose to learn the same transformation matrix W for all the layers , the performance drops.

Probing Masked Language Models for Parsing Information
In Section 3, we showed that attention models are expressive enough to execute the Inside-Outside algorithm and the intermediate states of the constructed model contain the syntactic information such as the probability of different labels for every span.However, these results are only existential, and it remains to see whether models trained with masked language modeling loss contain similar information, such as information about the spans in the syntactic parsing tree and the marginal probabilities computed by the Inside-Outside algorithm.
One difficulty in answering this question, as suggested by Maudslay and Cotterell [2021], is that syntactic probes on BERT-like models may leverage semantic cues to do parsing.To avoid this issue, we pre-train multiple RoBERTa models on synthetic datasets generated from English PCFG (Section 4.1), which eliminates semantic relations among tokens.We then probe the trained model for building parse trees (Section 4.2).We consider three settings for probing: train and test the probe on synthetic PCFG data (PCFG); train and test on PTB dataset (PTB); and train on the synthetic PCFG data while test on PTB (out of distribution, OOD).Note that in the OOD setting, semantic relations neither appear in the pre-trained model nor the probe.Hence the decision of the probe must come entirely from syntactic relations.This serves as a baseline for syntactic probe on PTB.To verify if the models indeed capture the information computed by the Inside-Outside algorithm, we further probe for marginal probabilities in the pre-trained models (Section 4.3).

Pre-training on PCFG
Experiment setup We generate 10 7 sentences for the training set from the PCFG, with an average length of 25 words.The training set is roughly 10% in size compared to the training set of the original RoBERTa which was trained on a combination of Wikipedia (2500M words) plus BookCorpus (800M words).We also keep a small validation set of 5 × 10 4 sentences generated from the PCFG to track the MLM loss.We follow Izsak et al. [2021], Wettig et al. [2022] to pre-train all our models within a single day on a cluster of 8 RTX 2080 GPUs.Specifically, we train our models with AdamW Loshchilov and Hutter [2017] optimization, using 4096 sequences in a batch and hyperparameters (β 1 , β 2 , ) = (0.9, 0.98, 10 −6 ).We follow a linear warmup schedule for 1380 training steps with the peak learning rate of 2 × 10 −3 , after which the learning rate drops linearly to 0 (with the max-possible training step being 2.3 × 10 4 ).We report the performance of all our models at step 5 × 10 3 where the loss seems to converge for all the models.Architecture To understand the impact of different components in the encoder model, we pre-train different models by varying the number of attention heads and layers in the model.To understand the role of the number of layers in the model, we start from the RoBERTa-base architecture, which has 12 layers and 12 attention heads, and vary the number of layers to 1,3,6 to obtain 3 different architectures.Similarily, to understand the role of the number of attention heads in the model, we start from the RoBERTa-base architecture and vary the number of attention heads to 3 and 24 to obtain 2 different architectures.For simplicity, we use AiLj to denote the model with i attention heads in each layer and j layers in this section (Section 4)

Experiment results
Table 3 shows the training and validation perplexity of different models.From the results, we first find that the trained models have small training and validation perplexity gap, implying that these models don't overfit the training set.We find that except for models with too few layers (A12L1) and too few attention heads (A3L12), all the other models have nearly the same train and test perplexity.Further increasing the depth and the number of attention heads does not seem to enhance the perplexity.The probe is set to be linear or a 2-layer neural net, and the input to the probe is layer 0's embedding from A12L12 or the embeddings from the layer that achieves the highest F1 score.

Probing for constituency parse trees
We probe the language models pre-trained on synthetic PCFG data and show that these models indeed capture the "syntactic information", in particular, they capture the structure of the constituency parse trees underlying the input sentences.
Experiment setup We mostly follow the probing procedure in Vilares et al. [2020], Arps et al. [2022] for constituency parsing that predicts the relative depth of the common ancestors.Given a sentence w 1 w 2 . . .w L with parse tree T (not necessarily binary), we denote depth(i, i + 1) the depth of the least common ancestor of w i , w i+1 in the parse tree T , we want to find a probe f ( ) to predict the relative depth tar(i) = depth(i, i + 1) − depth(i − 1, i) for position i.In Vilares et al. [2020], the probe f ( ) is linear, and the input to the probe f ( ) at position i is the concatenation of the embeddings at position i and the BOS (or EOS) token.Besides the linear probe f ( ) , we also experiment with the probe where f ( ) is a 2-layer neural network with 16 hidden neurons.
As discussed before, we consider three settings: PCFG, PTB, and the OOD setting.In PCFG setting, we train and test the probe f ( ) on the synthetic PCFG data we generated, and in PTB setting we train on PTB training set and test on PTB development set [Marcus et al., 1993] without removing the punctuations.In OOD setting we train the probe on the synthetic PCFG dataset, but test on the PTB development set, excluding nearly all the semantic information contained in the probe itself.Table 4: The parsing results (unlabelled F1 score) for different models under different settings.Linear and 2-layer NN denote the classifier for the probes respectively.Each entry denotes the best F1 score achieved using one of the layer's (contextualized) embeddings.The IO column denotes the results for parsing using the Inside-Outside algorithm.AiLj denotes the model with i attention heads and j layers.We highlight the scores that are within 1% to the max (except the IO) in each row.

Experiment results
From Figure 2, we first observe that in all settings there is a huge gap between the probe trained on layer 0's embeddings and the best layer's embeddings.We view the performance on layer 0's embedding as a baseline, and this shows that both probing methods benefit significantly from the representations of later layers.Table 4 shows detailed probing results for different settings (PCFG, PTB, and OOD), different probes (linear or a 2-layer neural net) on different models.Except for models with A12L1 and A3L12, the linear and neural net probes give decent parsing scores (> 70% sentence F1 for neural net probes) in both PCFG and PTB settings.As for the OOD setting, the performances achieved by the best layer drop by about 5% compared with PCFG and PTB, but they are still much better than the performance achieved by the 0-th layer embeddings.In this setting, there is no semantic information even in the probe itself and thus gives a baseline for the probes on PTB dataset that only uses syntactic data.As a comparison, the naive baseline, Right-branching (RB), reaches < 40% for both sentence and corpus F1 score [Li et al., 2020] on PTB dataset, and if we use layer 0's embeddings to probe, the sentence F1 is < 41% in all settings for all models.Our positive results on syntactic parsing support the claim that pre-training language models using masked language modeling loss can indeed capture the structural information of the underlying constituency parse tree.
Compared with probing results on BERT-like models pre-trained on natural language, whose labeled sentence F1 is 78.2% for BERT [Vilares et al., 2020] and 80.4% for RoBERTa [Arps et al., 2022] on PTB dataset, A12L12 model achieves 69.31% unlabelled F1 for linear probes and 71.32% for NN probe in the PTB setting.The performance gap with previous literature suggests that BERT-like models trained on natural language indeed contain semantic cues in their embeddings that help to parse.

Probing for the marginal probabilities
Section 4.2 verifies that language models can capture structure information of the constituency parse trees, but we still don't understand how this information is represented.Sections 3.1 and 3.2 suggest that a possible mechanism for transformers to capture the syntactic information is to execute the Inside-Outside algorithm.In this subsection, we test if the intermediate-layer representations of the transformer can be used to predict marginal probabilities computed in the Inside-Outside algorithm.

Experiment setup
To test if the model contains information computed by the Inside-Outside algorithm, we train a probe to predict the normalized marginal probabilities for spans with a specific length.Fix the span length , for each sentence w 1 w 2 . . .w L , denote e 1 , e 2 , . . ., e L the embeddings from the last layer of the pre-trained language model.We want to find a probe f ( ) such that for each span [i, i + − 1] with length , where s(i, j) = max A µ(A, i, j) is the marginal probability of span [i, j] and µ(A, i, j) is given by eq. 4. Here, the input to the probe [e i ; e i+ −1 ] ∈ R 2d is the concatenation of e i and e i+ −1 .To test the sensitivity of our probe, we also take the embeddings from the 0 th layer as input to the probe f ( ) .We give two choices for the probe f ( ) : (1) linear, and (2) a 2-layer neural network with 16 hidden neurons, since the relation between the embeddings and the target may not be a simple linear function.Similar to the Section 4.2, we also consider three settings: PCFG, PTB, and OOD.
Experiment results Figure 3a compares the 4 different probes for models with 12 attention heads and 12 layers: the input comes from layer 0 ( which is un-contextualized) and layer 12 (which is contextualized), and the probe is linear or is a 2-layer neural network.We observe that for both linear and neural net, changing the input from layer 0 to layer 12 drastically increases the predicted correlation, which again suggests that using just the token embeddings (the word embeddings and the positional embeddings) does not contain enough information about the marginal probabilities.Besides, the neural net can predict better on layer 12 embeddings, but performs nearly the same on layer 0, suggesting that the neural network is a better probe in this setting.
Figure 3b compares the probing results under three different settings.We observe that the probe can get a high correlation with the real marginal probabilities under all settings.Besides, it is surprising that there is nearly no performance drop if we change the testing dataset from PCFG to PTB (PCFG setting and OOD setting), implying that the probe (together with the embeddings) really contains the syntactic information computed by Inside-Outside algorithm instead of overfitting to the training dataset itself.
Table 5 shows the probing results on different pre-trained models.We can observe that the neural network probe is highly correlated with the target for pre-trained models except for models with 12 heads 1 layer and 3 heads 12 layers.When the length of spans increases, the probing correlation becomes worse, which means that the syntactic information for longer ranges is harder to be captured by the pre-trained language model.However surprisingly, even for length 10 spans, the NN probe can reach 78% F1 for the best model.The high correlation between probes and the target gives strong hint that the pre-trained models contain certain syntactic information computed by the Inside-Outside algorithm.All the results in this experiment show that training on MLM may incentivize the model to do some approximation of the Inside-Outside algorithm, verifying our constructions in Section 3.  5: Probing for the "normalized" marginal probabilities at different lengths with a linear model or a 2-layer neural net.We show the Pearson correlation between the predicted probabilities and the probabilities computed by the Inside-Outside algorithm on PTB datasets.Two numbers in each entry denote the correlations gotten by a linear probe and a 2-layer neural net probe respectively whose inputs come from the final layer of the model.AiLj denotes the model with i attention heads and j layers.

Related Works
Probing for transformers There has been an emerging interest in understanding the information that BERT-like models encode implicitly in their embeddings [Rogers et al., 2020], e.g. the syntactic (structural) information [Hewitt and Manning, 2019, Reif et al., 2019, Manning et al., 2020, Vilares et al., 2020, Maudslay et al., 2020, Maudslay and Cotterell, 2021, Chen et al., 2021, Arps et al., 2022].However as mentioned in Maudslay and Cotterell [2021], the probing success of the previous works on syntax emerged from the model using semantic cues to parse.The probed pre-trained models had been pre-trained on natural language datasets, where the semantic structures are most likely correlated with the syntactic ones.Indeed, Arps et al. [2022] tried to separate the semantics from syntax by training the probe on a mixture of natural language and manipulated data, however, the authors acknowledged the possibility of semantics still affecting the decision of the probe trained on manipulated data.Besides syntax, researchers have also performed probing experiments for other linguistic structures like semantics, sentiment, etc. [Belinkov et al., 2017, Reif et al., 2019, Kim et al., 2020, Richardson et al., 2020].Yun et al. [2020a,b] showed that transformers are universal sequenceto-sequence function approximators, while Pérez et al. [2021], Bhattamishra et al. [2020] showed that attention models can simulate Turing machines.Attention models with bounded size have been shown capable of recognizing deterministic context-free languages such as bounded-depth Dyke-k [Yao et al., 2021].The size of the constructed models, however, depends on the complexity of the target function and often requires arbitrary precision to encode the target function.Wei et al. [2022] proposed statistically meaningful approximations of Turing machines using attention models, that also exhibit good statistical learnability.Liu et al. [2022] constructed (by hand) transformers that can efficiently simulate automata for inputs of a small range of lengths.Interestingly, we observe that language models pre-trained on PCFG-generated data encode relevant information from the Inside-Outside algorithm for sentences from both natural language and PCFG-generated data, which suggests an implicit bias toward learning the general algorithm (generalizing to all input lengths).A careful study is left for future work.

Conclusion and Further Discussion
In this work, we show that masked language models with moderate size have the capacity to parse decently well.Besides, we probe BERT-like models pre-trained (with MLM loss) on synthetic text generated using PCFGs and empirically verify that these models capture syntactic information that helps reconstruct (partially) a parse tree.Furthermore, we show that the models contain the marginal span probabilities computed by the Inside-Outside algorithm, thus connecting masked language pre-training and parsing.We hope our findings may yield new insights into large language models and masked language modeling.One limitation of our paper is that we use probing experiments to show the existence of Inside-Outside probabilities inside the contextualized embeddings.However, we don't have definitive experiments to show whether the learned model actually simulates the Inside-Outside algorithm on the input.We leave for future work the design of experiments to interpret the content of the contextualized embeddings and thus "reverseengineer" the algorithms used by the learned model.Other interesting directions include the convergence analysis of attention models on different generative models, the importance of model scale during pre-training under different generative models, and the differences between parsing with shallow and deep models.Probing on embeddings from different layers In Section 4.2, we show the probing results on the embeddings either from 0-th layer or from the best layer (the layer that achieves the highest F1 score) of different pre-trained models.In this section, we show how the F1 score changes with different layers.
Figure 4 shows sentence F1 scores for linear probes f (•) trained on different layers' embeddings for different pre-trained models.We show the results under the PCFG and PTB settings.From Figure 4, we observe that using the embeddings from the 0-th layer can only get sentence F1 scores close to (or even worse than) the naive Right-branching baseline for all the pre-trained models.However, except for model A3L12, the linear probe can get at least 60% sentence F1 using the embeddings from layer 1.Then, the sentence F1 score increases as the layer increases, and gets nearly saturated at layer 3 or 4. The F1 score for the latter layers may be better than the F1 score at layer 3 or 4, but the improvement is not significant.The observations still hold if we change the linear probe to a neural network, consider the OOD setting instead of PCFG and PTB, or change the measurement from sentence F1 to corpus F1.
Our observations suggest that most of the constituency parse tree information can be encoded in the lower layers, and a lot of the parse tree information can be captured even in the first layer.Although our constructions (Theorems 3.1 and 3.2) and approximations (Theorems 3.5 and 3.7) try to reduce the number of attention heads and the number of embedding dimensions close to the real language models, we don't know how to reduce the number of layers close to BERT or RoBERTa (although our number is acceptable since GPT-3 has 96 layers).More understanding of how language models can process such information in such a small number of layers is needed.
Comparison with probes using other input structures In Section 4.2, we train a probe f (•) to predict the relative depth tar(i) = depth(i, i + 1) − depth(i − 1, i), and the input to the probe f is the concatenation of the embedding e Figure 5 shows the probing results on A12L12, the model with 12 attention heads and 12 layers.We compare the probes with different inputs structure (EOS or ADJ), and the input embeddings come from different layers (the 0-th layer or the layer that achieves the best F1 score).We observe that: (1) the probes using ADJ input structure have better parsing scores than the probes using EOS input structure, and (2) the sentence F1 for the probes using the ADJ input structure is high even if the input comes from layer 0 of the model (> 55% for linear f (•) and > 60% for neural network f (•)).Although the probe using ADJ has better parsing scores than the probe using EOS, it is harder to test whether it is a good probe, since the concatenation of adjacent embeddings [e i+1 ] from layer 0 is already contextualized, and it is hard to find a good baseline to show that the probe is sensitive to the information we want to test.Thus, we choose to follow Vilares et al. [2020], Arps et al. [2022] and use the probe with input structure [e   word embedding and the positional embedding) and train a 2-layer neural network on top of that, we can get 62.67%, 63.91%, 57.02% sentence F1 scores under PCFG, PTB, and OOD settings respectively.As a comparison, the probe taking [e ( ) i ; e ( ) EOS ] as input [Vilares et al., 2020, Arps et al., 2022] only get 39.06%, 39.31%, 33.33% sentence F1 under PCFG, PTB, and OOD settings respectively.It shows that lots of syntactic information (useful for parsing) can be captured by just using adjacent words without more context.
More discussion on probing measurement (Unlabelled) F1 score is the default performance measurement in the constituency parsing and syntactic probing literature.However, we would like to point out that only focusing on the F1 score may cause some bias.Because all the spans have equal weight when computing the F1 score, and most of the spans in a tree have a short length (if the parse tree is perfectly balanced, then length 2 spans consist of half of the spans in the parse tree), one can get a decently well F1 score by only getting correct on short spans.Besides, we also show that by taking the inputs [e i+1 ] from layer 0 of the model (12 attention heads and 12 layers), we can already capture a lot of the syntactic information useful to recover the constituency parse tree (get a decently well F1 score).Thus, the F1 score for the whole parse tree may cause people to focus less on the long-range dependencies or long-range structures, and focus more on the short-range dependencies or structures.
To mitigate this problem, Vilares et al. [2020] computed the F1 score not only for the whole parse tree, but also for each length of spans.Vilares et al. [2020] showed that BERT trained on natural language can get a very good F1 score when the spans are short (for length 2 spans, the probing F1 is over 80%), but when the span becomes longer, the F1 score quickly drops.Even for spans with length 5, the F1 score is less than 70%, and for spans with length 10, the F1 score is less than 60%.Our experiments that probe the marginal probabilities for different lengths of spans (Section 4.3) can also be viewed as an approach to mitigate the problem.

A.2 Analysis of attention patterns
In Section 4.2, we probe the embeddings of the models pre-trained on synthetic data generated from PCFG and show that model training on MLM indeed captures syntactic information that can recover the constituency parse tree, but we don't know how the models capture that information.Theorem 3.3 builds the connection between MLM and the Inside-Outside algorithm, and the connection is also verified in Section 4.3, which shows that the embeddings also contain the marginal probability information computed by the Inside-Outside algorithm.However, we only build up the correlation between the Inside-Outside algorithm and the attention models, and we still don't know the mechanism inside the language models: the model may be executing the Inside-Outside algorithm (or some approximations of the Inside-Outside algorithm), but it may also use some mechanism far from the Inside-Outside algorithm but happens to contain the marginal probability information.To understand more about the mechanism of language models, we need to open up the black box and go further than probing, and this section serves as one step to do so.

General idea
The key ingredient that distinguishes current large language models and the fully-connected neural networks is the self-attention module.Thus besides probing for certain information, we can also look at the attention score matrix and discover some patterns.In particular, we are interested in how far an attention head looks at, which we called the "averaged attended distance".
Averaged attended distance For a model and a particular attention head, given a sentence s with length L s , the head will generate an L s × L s matrix A containing the pair-wise attention score, where each row of A sums to 1. Then we compute the following quantity "Averaged attended distance" which can be intuitively interpreted as "the average distance this attention head is looking at".We then take the average of the quantity for all sentences.We compute "Averaged attended distance" for three models on the synthetic PCFG dataset and PTB dataset.The models all have 12 attention heads in each layer but have 12, 6, 3 layers respectively.

Experiment results
Figure 6 shows the results of the "Averaged attented distance" for each attention head in different models.Figures 6a, 6c and 6e show the results on the synthetic PCFG dataset, and Figures 6b, 6d and 6f show the results on the PTB dataset.We sort the attention heads in each layer according to the "Averaged attended distance".From Figures 6a, 6c and 6e, we can find that for all models, there are several attention heads in the first layer that look at very close tokens ("Averaged attended distance" less than 3).Then as the layer increases, the "Averaged attended distance" also increases in general, meaning that the attention heads are looking at further tokens.Then at some layer, there are some attention heads looking at very far tokens ("Averaged attended distance" larger than 12).4This finding also gives some implication that the model is doing something that correlates with our construction: it looks longer spans as the layer increases.However, different from our construction that the attention head only looks at a fixed length span, models trained using MLM look at different lengths of spans at each layer, which cannot be explained by our current construction, and suggests a further understanding of the mechanism of large language models.
Besides, we can find that the patterns are nearly the same for the synthetic PCFG dataset and PTB dataset, and thus the previous finding can also be transferred to the PTB dataset.Notations: For typographical simplicity, we will divide our embeddings into 2 sub-parts.We will use the first |N |L coordinates to store the inside probabilities, and the second |N |L coordinates to store the outside probabilities.For every position i and span length + 1, we store the inside probabilities {α(A, i − , i)} A∈N after computation in its embedding at coordinates [|N | , |N |( + 1)), where the coordinates for embeddings start from 0. Similarly we store {β(A, i, i + )} A∈N at [|N |(L + ), |N |(L + + 1)).For simplicity of presentation, we won't handle cases where i + or i − is outside the range of 1 to L -those coordinates will be fixed to 0.
Token Embeddings: The initial embeddings for each token w will contain Pr[A → w] for all A ∈ P.This is to initiate the inside probabilities of all spans of length 1.
Relative position embeddings: We introduce 2L + 1 relative position vectors {p t ∈ R 2|N |L } −L≤t≤L , that modify the key vectors depending on the relative position of the query and key tokens.Furthermore, we introduce (2L − 1)L relative position-dependent biases {b t, ∈ R} −L≤t≤L,1≤ ≤2L−1 .We introduce the structures of the biases in the contexts of their intended uses.
Structure of {p t } −L≤t≤L : For t < 0, we define p t such that all coordinates in [|N |(−t − 1), |N |(−t)) are set to 1, with the rest set to 0. For t > 0, we define p t such that all coordinates in [|N |(L + t − 1), |N |(L + t)) are set to 1, with the rest set to 0. p 0 is set as all 0s.
Attention formulation: At any layer 1 ≤ ≤ 2L − 1 except L, we define the attention score a h i,j between e For layer L, we do not use the relative position embeddings, i.e. we define the attention score a h i,j between e Inside probabilities: The contextual embeddings at position i after the computations of any layer < L contains the inside probabilities of all spans of length at most + 1 ending at position i, i.e. α(A, i − k, i) for all A ∈ N and k ≤ .The rest of the coordinates contain 0.
Structure of {b t, } −L≤t≤L,1≤ ≤L−1 : For any 1 ≤ ≤ L − 1, for all t ≥ 0 and t < − , we set b t, as ζ for some large constant ζ.All other biases are set as 1.
Layer 1 ≤ < L: At each position i, this layer computes the inside probabilities of spans of length + 1 ending at i, using the recursive formulation from eq. 2.
For every non-terminal A ∈ N , we will use a unique attention head to compute α(A, i − , i) at each token i.Specifically, the attention head representing non-terminal A ∈ N will represent the following operation at each position i: (15) In the final step, we swapped the order of the summations to observe that the desired computation can be represented as a sum over individual computations at locations j < i.That is, we represent B,C∈N Pr[A → BC] • α(B, i − , j) • α(C, j + 1, i) as the attention score a i,j for all i − ≤ j ≤ i, while α(A, i − , i) will be represented as i− ≤j<i−1 a i,j .
If A ∈ P, we replace the character A to w with probability Pr[A → w].If A ∈ I, we replace the character A to two characters BC with probability Pr[A → BC].
with the rest set to −ζ for some large constant ζ.The rest of the blocks are set as 0. We give an intuition behind the structure of Q For any position i and range 1 ≤ , e ( −1) i contains the inside probabilities {α . First, we set the Key matrix K ( ˜ ) A,1 as I.If we define P A,r ∈ R |N |×|N | as a matrix that contains {Pr[B → CA]} B,C∈N , which is the set of all rules where A appears as the right child, Q with the rest set to −ζ for some large constant ζ.The rest of the blocks are set as 0. We give an intuition behind the structure of Q

Figure 2 :
Figure2: Comparison between different probes under different settings.The probe is set to be linear or a 2-layer neural net, and the input to the probe is layer 0's embedding from A12L12 or the embeddings from the layer that achieves the highest F1 score.
We compare 4 probes under PTB setting: the input comes from the 0-th layer or the 12-th layer, and the probe is linear or NN.We compare the NN probe using the 12-th layer embeddings from the A12L12 model under different settings.

Figure 3 :
Figure 3: Comparison between different probes for marginal probabilities under different settings on the pre-trained model with 12 attention heads and 12 layers.The y-axis denotes the correlation between the probe output and the target, and the x-axis denotes probes for different lengths.
position i and the embedding e ( ) EOS for the EOS token at some layer .Besides taking the concatenation [e ( ) i ; e ( ) EOS ] as the input structure of the probe, it is also natural to use the concatenation [e predict the relative depth tar(i).In this part, we compare the performances of probes with different input structures.We use EOS to denote the probe that takes [e ( ) i ; e ( ) EOS ] as the input and predicts the relative depth, while ADJ (Adjacent embeddings) to denote the probe the takes [e Section 4.2.Nonetheless, the experiment results for probes taking [e i+1 ] as input are already surprising: by knowing three adjacent word identities and their position (the token embedding e Comparison under PCFG setting.We compare the models with different number of layers.Comparison under PCFG setting.We compare the models with different number of attention heads.Comparison under PTB setting.We compare the models with different number of layers.Comparison under PTB setting.We compare the models with different number of attention heads.

Figure 4 :
Figure 4: Sentence F1 for linear probes f (•) trained on different layers' embeddings for different pre-trained models.We show the results under PCFG and PTB settings.AiLj denotes the pre-trained model with i attention heads and j layers.

Figure 5 :
Figure 5: Comparison of the probes with different inputs under different settings.We probe the model with 12 attention heads and 12 layers, and report the scores with f (•) taking embeddings from layer 0 or the embeddings from the best layer.EOS denotes the probe that takes [e ( ) i ; e ( ) EOS ] as input and predicts the relative depth tar(i), and ADJ (Adjacent embeddings) denotes the probe that takes [e ( ) i−1 ; e ( ) i ; e ( ) i+1 ] as input.

B
Missing Proofs in Section 3 B.1 Proof of Theorem 3.2Similar to the proof of Theorem 3.1, the first L − 1 layers simulate the recursive formulation of the Inside probabilities from eq. 2, and the last L − 1 layers simulate the recursive formulation of the outside probabilities from eq. 3. The model uses embeddings of size 2|N |L and uses 4L + 2 relative position embeddings.12 attention heads and 12 layers, PCFG dataset.12 attention heads and 12 layers, PTB dataset.12 attention heads and 3 layers, PTB dataset.

Figure 6 :
Figure 6: "Averaged attented distance" of each attention heads for different models on PCFG and PTB datasets.Figures 6a, 6c and 6e show the results on the synthetic PCFG dataset, and Figures 6b, 6d and 6f show the results on the PTB dataset.

Table 1 :
Experiment results by approximately computing the Inside-Outside algorithm with very few nonterminals.

Table 2 :
| Ĩ| is the concatenation of all inside Experiment results using learned transformations W

Table 3 :
The perplexity of different RoBERTa models pre-trained on synthetic PCFG data.AiLj denotes the model with i attention heads and j layers.