Self-Attention Networks Can Process Bounded Hierarchical Languages

Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck-k, the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyck-(k, D), the subset of Dyck-k with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D+1 layers and O(log k) memory size (per token per layer) that recognizes Dyck-(k, D), and a soft-attention network with two layers and O(log k) memory size that generates Dyck-(k, D). Experiments show that self-attention networks trained on Dyck-(k, D) generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.


Introduction
Transformers (Vaswani et al., 2017) are now the undisputed champions across several benchmark leaderboards in NLP. The major innovation of this architecture, self-attention, processes input tokens in a distributed way, enabling efficient parallel computation as well as long-range dependency modelling. The empirical success of self-attention in NLP has led to a growing interest in studying its properties, with an eye towards a better understanding of the nature and characteristics of natural language (Tran et al., 2018;Papadimitriou and Jurafsky, 2020).
In particular, it was recently shown that selfattention networks cannot process various kinds of formal languages (Hahn, 2020;Bhattamishra et al., 2020a), among which particularly notable is Dyck k , the language of well-balanced brackets of k types. By the Chomsky-Schützenberger Theorem (Chomsky and Schützenberger, 1959), any context-free language can be obtained from a Dyck k language through intersections with regular languages and homomorphisms. In other words, this simple language contains the essence of all context-free languages, i.e. hierarchical structure, center embedding, and recursion -features which have been long claimed to be at the foundation of human language syntax (Chomsky, 1956).
Consider for example the long-range and nested dependencies in English subject-verb agreement: (Laws (the lawmaker) [ Given the state-of-the-art performance of Transformers in parsing natural language He and Choi, 2019), the Dyck k blind spot seems very suggestive. If the world's best NLP models cannot deal with this simple language -generated by a grammar with k + 2 rules and recognized by a single-state pushdown automaton -does this not mean that the role of hierarchy and recursion in natural language must be limited? This question has of course, been extensively debated by linguists on the basis of both theoretical and psycholinguistic evidence (Hauser et al., 2002;Frank et al., 2012;Nelson et al., 2017;Brennan and Hale, 2019;Frank and Christiansen, 2018). So, what can self-attention networks tell us about natural language and recursion? Here we provide a new twist to this question by considering Dyck k,D , the subset of Dyck k with nesting depth at most D, and show that Transformers can process In construction (a), at each layer, the innermost brackets attend to their matching brackets and "cancel" each other, yielding "shallower" spans for successive layers to process. In construction (b), the first layer computes the depth of each token by attending to all previous tokens, while the second layer uses depth information to find the most recent unclosed open bractket in the history.
it. Dyck k,D models bounded (or finite) recursion, thus captures the hierarchical structure of human language much more realistically. For example, center-embedding depth of natural language sentences is known to rarely exceed three (Karlsson, 2007;Jin et al., 2018), and while pragmatics, discourse, and narrative can result in deeper recursion in language (Levinson, 2014), there is arguably a relatively small limit to the depth as well.
In particular, we prove that self-attention networks can both recognize and generate Dyck k,D , with two conceptually simple yet different constructions ( Figure 1). The first network requires D + 1 layers and a memory size of O(log k) (per layer per token) to recognize Dyck k,D , using a distributed mechanism of parenthesis matching. The second network has two layers and memory size O(log k). It works by attending to all previous tokens to count the depth for each token in the first layer, and then uses this depth information to attend to the most recent unclosed open bracket in the second layer. Our constructions help reconcile the result in Hahn (2020) with the success of Transformers in handling natural languages.
Our proof requires certain assumptions about the positional encodings, an issue that is often considered in empirical papers (Ke et al., 2021;Shaw et al., 2018;Wang et al., 2020;Shiv and Quirk, 2019) but not in the more theoretical literature. First, positional encodings must have log n bits when the input length is n, as otherwise different positions would share the same representation. More importantly, positional encodings should support easy position comparisons, since token order is vital in formal language processing. Our experiments show that two standard practices, namely learnable or fixed sine/cosine positional encodings, cannot generalize well on Dyck k,D beyond the training input lengths. In contrast, using a single fixed scalar monotonic positional encoding such as pos/n achieves near-perfect accuracy even on inputs significantly longer than the training ones. Our findings provide a novel perspective on the function of positional encodings, and implies that different applications of self-attention networks (in this case, natural vs. formal language) may require different model choices.
Our theoretical results also bring about interesting comparisons to recurrent networks (e.g. RNNs, LSTMs) in terms of the resource need to process hierarchical structure. While recurrent networks with finite precision need at least Ω(D log k) memory to process Dyck k,D (Hewitt et al., 2020), our second construction requires only O(log k) memory but a O(log n) precision. In experiments where precision is not an issue for practical input lengths (< 10 4 ), we confirm that a Transformer requires less memory than a LSTM to reach high test accuracies. This may help explain why Transformers outperform RNNs/LSTMs in syntactical tasks in NLP, and shed light into fundamental differences between recurrent and non-recurrent sequence processing.

Related work
Our work primarily relates to the ongoing effort of characterizing theoretical abilities (Pérez et al., 2019;Bhattamishra et al., 2020b;Yun et al., 2020) and limitations of self-attention networks, particularly through formal hierarchical structures like Dyck k . Hahn (2020) proves that (even with positional encodings) hard-attention Transformers cannot model Dyck k , and soft-attention Transformers with bounded Lipschitz continuity cannot model Dyck k with perfect cross entropy. Bhattamishra et al. (2020a) prove a soft-attention network with positional masking (but no positional encodings) can solve Dyck 1 but not Dyck 2 . Despite the expressivity issues theoretically posed by the above work, empirical findings have shown Transformers can learn Dyck k from finite samples and outperform LSTM (Ebrahimi et al., 2020). Our work addresses the theory-practice discrepancy by using positional encodings and modeling Dyck k,D .
A parallel line of work with much lengthier tradition (Elman, 1990;Das et al., 1992;Steijvers and Grünwald, 1996) investigates the abilities and limitations of recurrent networks to process hierarchical structures. In particular, RNNs or LSTMs are proved capable of solving context-free languages like Dyck k given infinite precision (Korsky and Berwick, 2019) or external memory (Suzgun et al., 2019;Merrill et al., 2020). However, Merrill et al. (2020 also prove RNNs/LSTMs cannot process Dyck k without such assumptions, which aligns with experimental findings that recurrent networks perform or generalize poorly on Dyck k (Bernardy, 2018;Sennhauser and Berwick, 2018;Yu et al., 2019). Hewitt et al. (2020) propose to consider Dyck k,D as it better captures natural language, and show finite-precision RNNs can solve Dyck k,D with Θ(D log k) memory.
For the broader NLP community, our results also contribute to settling whether self-attention networks are restricted to model hierarchical structures due to non-recurrence, a concern (Tran et al., 2018) often turned into proposals to equip Transformers with recurrence (Dehghani et al., 2019;Shen et al., 2018;Chen et al., 2018;Hao et al., 2019). On one hand, Transformers are shown to encode syntactic (Lin et al., 2019;Tenney et al., 2019;Manning et al., 2020) and word order (Yang et al., 2019) information, and dominate syntactical tasks in NLP such as constituency  and dependency (He and Choi, 2019) parsing. On the other hand, on several linguistically-motivated tasks like English subject-verb agreement (Tran et al., 2018), recurrent models are reported to outperform Transformers. Our results help address the issue by confirming that distributed and recurrent sequence processing can both model hierarchical structure, albeit with different mechanisms and tradeoffs.

Dyck Languages
Consider the vocabulary of k types of open and close brackets Σ = ∪ i∈[k] { i , i }, and define Dyck k ⊂ γΣ * ω (γ, ω being special start and end tokens) to be the formal language of well-nested brackets of k types. It is generated starting from γXω through the following context-free grammar: where denotes the empty string. Intuitively, Dyck k can be recognized by sequential scanning with a stack (i.e., a pushdown automaton). Open brackets are pushed into the stack, while a close bracket causes the stack to pop, and the popped open bracket is compared with the current close bracket (they should be of the same type). The depth of a string w 1:n at position i is the stack size after scanning w 1:i , that is, the number of open brackets left in the stack: Finally, we define Dyck k,D to be the subset of Dyck k strings with depth bounded by D: That is, a string in Dyck k,D only requires a stack with bounded size D to process.

Self-attention Networks
We consider the encoder part of the original Transformer (Vaswani et al., 2017), which has multiple layers of two blocks each: (i) a self-attention block and (ii) a feed-forward network (FFN). For an input string w 1:n ∈ Σ * , each input token w i is converted into a token embedding via f e : Σ → R d model , then added with a position encoding p i ∈ R d model . Let x i, ∈ R d model be the i-th representation of the -th layer (i ∈ [n], ∈ [L]). Then Attention In each head of a self-attention block, the input vectors x 1:n undergo linear transforms Q, K, V yielding query, key, and value vectors. They are taken as input to a self-attention module, whose t-th output, Att(Qx i , Kx, V x), is a vector a i = j∈[T ] α j V x j , where α 1:n = softmax( Qx i , Kx 1 , · · · , Qx i , Kx n ). The final attention output is the concatenation of multihead attention outputs. We also consider variants of the basic model along these directions: (i) Hard attention, as opposed to soft attention described above, where hardmax is used in place for softmax (i.e. Att(Qx i , Kx, V x) = V x j where j = arg max j Qx i , Kx j ). Though impractical for NLP, it has been used to model formal languages (Hahn, 2020).
Feed-forward network A feed-forward network F transforms each self-attention output vector a i → F (a i ) individually. It is usually implemented as a multi-layer perceptron (MLP) with ReLU activations. Residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) are two optional components to aid learning. Vaswani et al. (2017) proposes two kinds of positional encoding: (i) Fourier features (Rahimi and Recht, 2007), i.e. sine/cosine values of different frequencies; (ii) learnable features for each position. In this work we propose to use a single scalar i/n to encode position i ∈ [n], and show that it helps process formal languages like Dyck k,D , both theoretically and empirically.

Positional encodings
Precision and memory size We define precision to be the number of binary bits used to represent each scalar, and memory size per layer (d model ) to be the number of scalars used to represent each token at each layer. The memory size (L · d model ) is the total memory used for each token.

Language Generation and Recognition
For a Transformer with L layers and input w 1:i , we can use a decoder (MLP + softmax) on the final token output x i,L to predict w i+1 . This defines a language model f θ (w i+1 |w i ) where θ denotes Transformer and decoder parameters. We follow previous work (Hewitt et al., 2020) to define how a language model can generate a formal language: We also consider language recognition by a language classifier g θ (w 1:i ), where a decoder on x i,L instead predicts a binary label. Definition 3.2 (Language recognition). Language classifier g θ over Σ recognizes a language L ⊆ Σ if L = {w 1:n ∈ Σ |g θ (w 1:n ) = 1}.

Theoretical Results
In this section we state our theoretical results along with some remarks. Proof sketches are provided in the next section, and details in Appendix A,B,C.  Required precision Both constructions require a precision increasing with input length, as indicated by Theorem 4.3. The proof of the lower bound is inspired by the proof in Hahn (2020), but several technical improvements are necessary; see Appendix C. Intuitively, a vector with a fixed dimension and o(log n) precision cannot even represent n positions uniquely. The required precision is not unreasonable, since log n is a small overhead to the n tokens the system has to store.
Comparison to recurrent processing Hewitt et al. (2020) constructs a 1-layer RNN to generate Dyck k,D with Θ(D log k) memory, and proves it is optimal for any recurrent network. Thus Theorem 4.2 establishes a memory advantage of selfattention networks over recurrent ones. However, this is based on two tradeoffs: (i) Precision. Hewitt et al. (2020) assumes O(1) precision while we require O(log n). (ii) Runtime. Runtime of recurrent and self-attention networks usually scale linearly and quadratically in n, respectively.
Comparison between two constructions Theorem 4.2 requires fewer layers (2 vs. D) and memory size (O(log k) vs. O(D log k)) than Theorem 4.1, thanks to the use of soft-attention, residual connection and layer normalization. Though the two constructions are more suited to the tasks of recognition and generation respectively (Section 5), each of them can also be modified for the other task.
Connection to Dyck k In Hahn (2020) it is shown that no hard-attention network can recognize Dyck k even for k = 1. Theorem 4.1 establishes that this impossibility can be circumvented by bounding the depth of the Dyck language. Hahn (2020) also points out soft-attention networks can be limited due to bounded Lipschitz continuity. In fact, our Theorem 4.2 construction can also work on Dyck k with some additional assumptions (e.g. feed n also in input embeddings), and we circumvent the impossibility by using laying normalization, which may have an O(n) Lipschitz constant. More details are in Appendix B.4.

(D + 1)-layer Hard-Attention Network
Our insight underlying the construction in Theorem 4.1 is that, by recursively removing matched brackets from innermost positions to outside, each token only needs to attend to nearest unmatched brackets to find its matching bracket or detect error within D layers. Specifically, at each layer ≤ D, each token will be in one of three states (Figure 2 (c)): (i) Matched, (ii) Error, (iii) Unmatched, and we leverage hard-attention to implement a dynamic state updating process to recognize Dyck k,D .
Representation For an input w 1:n ∈ γΣ * ω, the representation at position i of layer has five parts : (i) a bracket type embedding t i ∈ R log k that denotes which bracket type (1 · · · k) the token is (or if the token is start/end token); (ii) a bracket openness bit o i ∈ {0, 1}, where 1 denotes open brackets (or start token) and 0 denotes close one (or end token); (iii) a positional encoding scalar p i = i/n; (iv) a match bit m i, ∈ {0, 1}, where 1 denotes matched and 0 unmatched; (v) an error bit e i, ∈ {0, 1}, where 1 denotes error and 0 no error. Token identity parts t i , o i , p i are maintained unchanged throughout layers. The match and error bits are initialized as e i,0 = m i,0 = 0.
The first D layers have identical self-attention blocks and feed-forward networks, detailed below.
We have 3 attention heads: (i) an identity head Att id , where each token only attends to itself with attention output a id i = x i ; (ii) a left head Att left with future positional masking; (iii) a right head Att right with past positional masking. The query, key, and value vectors for Att left are defined as is the representation of the nearest unmatched token to i on its left side. Similarly is the representation of the nearest unmatched token to i on its right side. The attention output for position i is the concatenation of these three outputs: Feed-forward network (FFN) Following the notation above, the feed-forward network F : a i → y i serves to update each position's state using information from x j 1 , x j 2 . The high level logic (Figure 2 (c)) is that, if w i is an open bracket, its potential matching half should be w j = w j 2 (j 2 > i), otherwise it should be w j = w j 1 (j 1 < i). If w i and w j are one open and one close, they either match (same type) or cause error (different types). If w i and w j are both open or both close, no state update is done for position i. Besides, token identity parts t i , o i , p i are copied from a id i to pass on. The idea can be translated into a language of logical operations (∧, ∨, ¬) plus a SAME(t, t ) operation, which returns 1 if vectors t = t and 0 otherwise:  As we show in Appendix A, a multi-layer perception with ReLU activations can simulate all operations (∧, ∨, ¬, SAME), thus the existence of our desired FFN.
Final layer At the (D + 1)-th layer, the self attention is designed as If all brackets are matched without error ((e i , m i ) = (0, 1)), all keys would be 0, and the attention output of the last token a n would be (0, 1). If any bracket finds error (e i = 1) or is not matched (m i = 0), the key would be at least 1 and a n would not be (0, 1). An FNN that emulates (a, b) → ¬a ∧ b will deliver y n as the recognition answer.

Two-layer Soft-Attention Network
Our Theorem 4.2 construction takes advantage of soft attention, residual connection, and layer normalization to calculate each token depth and translate it into a vector form at the first layer. First Layer -Depth Counting The first selfattention layer has two heads, where an Att id head is still used to inherit t i , o i , p i , and a future positional masking head 2 Att d aims to count depth with Qx i = Kx i = 1 and V x i = 2o i − 1, resulting in uniform attention scores and attention output a d i = j≤i 1 i · (2o j − 1) = d(w 1:i )/i. However, our goal is to enable matching based on depth d i = d(w 1:i ), and the attention output d i /i isn't readily usable for such a purpose: the denominator i is undesirable, and even a scalar d i cannot easily attend to the same value using dotproduct attention. Thus in the first feed-forward network, we leverage residual connection and layer normalization to transform value for every d ∈ {0, · · · , D + 1}, so that The representation by the end of first layer is The full detail for the first FFN is in Appendix B.1.
Second layer -Depth Matching The second self-attention layer has a depth matching hardattention head Att match , with query, key, value vectors as With such a [x i , x j ], the second-layer FFN can readily predict what w i+1 could be. It could be any open bracket when d i < D (i.e. cos(θ(d i )) > cos(θ(D))), and it could be a close bracket with type as t j (or end token if w j is start token). The detailed construction for such a FFN is in Appendix B.2.
On Dyck k Generation In fact, this theoretical construction can also generate Dyck k , as intuitively the O(log n) precision assumption allows counting depth up to O(n). But it involves extra conditions like feeding n into network input, and may not be effectively learned in practice. Please refer to details in Appendix B.4.

Connection to Empirical Findings
Our theoretical construction explains the observation in Ebrahimi et al. (2020): the second layer of a twolayer Transformer trained on Dyck k often produces virtually hard attention, where tokens attend to the stack-top open bracket (or start token). It also explains why such a pattern is found less systematically as input depth increases, as (6) is hard to learn and generalize to unbounded depth in practice.

Experiments
Our constructions show the existence of selfattention networks that are capable of recognizing and generating Dyck k,D . Now we bridge theoretical insights into experiments, and study whether such networks can be learned from finite samples and generalize to longer input. The answer is affirmative when the right positional encodings and memory size are chosen according to our theory.
We first present results on Dyck 8,10 (Section 6.1) as an example Dyck k,D language to investigate the effect of different positional encoding schemes, number of layers, and hidden size on the Transformer performance, and to compare with the LSTM performance. We then extend the Transformer vs. LSTM comparison on more Dyck k,D languages (k ∈ {2, 8, 32, 128}, D ∈ {3, 5, 10, 15}) in Section 6.2. Finally, we apply the novel scalar positional encoding to natural language modeling with some preliminary findings (Section 6.3).

Evaluation on Dyck 8,10
Setup For Dyck 8,10 , we generate training and validation sets with input length n ≤ 700, and test set with length 700 < n ≤ 1400. We train randomly initialized Transformers using the Huggingface library ( On the validation set of Dyck 8,10 (see Appendix D.2), all three models achieve near-perfect accuracy with L ≥ 2 layers. On the test set (Figure 4(a)) however, only POS/N maintains nearperfect accuracy, even with L = 10 layers. Meanwhile, LEARN and COS fail to generalize, because encodings for position 700 < i ≤ 1400 are not learned (for LEARN) or experienced (for COS) during training. The result validates our theoretical construction, and points to the need for separate and systemic positional encodings for processing long and order-sensitive sequences like Dyck k,D .

Memory Size and Comparison with LSTM
We compare a two-layer Transformer (POS/N) with a one-layer LSTM 3 (Hochreiter and Schmidhuber, 1997) using varying per-layer memory sizes d model ∈ {10, 20, · · · , 100}. As Figure 4 (b) shows, the Transformer consistently outperforms the LSTM on the validation set. On the test set (Figure 4 (c)), the Transformer and the LSTM first achieve a > 90% accuracy using d model = 20 and 40 respectively, and an accuracy of > 95% with d model = 30 and 50, respectively. These findings agree with our theoretical characterization that selfattention networks have a memory advantage over recurrent ones.

Evaluation on More Dyck k,D Languages
Setup In order to generalize some of the above results, we generate a wide range of Dyck k,D languages with different vocabulary sizes (k ∈ {2, 8, 32, 128}) and recursion bounds (D ∈ {3, 5, 10, 15}). We continue to compare the onelayer LSTM versus the two-layer Transformer (POS/N). For each model on each language, we perform a hyperparameter search for learning rate in {0.01, 0.001} and memory size d model ∈ {10, 30, 50}, and report results from the best setting based on two trials for each setting.

Results
The validation and test accuracy of the models are reported in Figure 5,

Evaluation on WikiText-103
In Section 6.1, we show a Transformer with the scalar positional encoding scheme (POS/N) can learn Dyck k,D and generalize to longer input, while traditional positional encoding schemes ((COS), (LEARN)) lead to degraded test performance. To investigate whether such a novel scheme is also useful in NLP tasks, we train two RoBERTa 5 models (POS/N, LEARN) from scratch on the WikiText-103 dataset (Merity et al., 2017) for 150 epochs. Figure 6 shows the masked language modeling loss on both training and validation sets. By the end of the training, POS/N has a slightly larger validation loss (1.55) than LEARN (1.31). But throughout the optimization, POS/N shows a gradual decrease of loss while LEARN has a sudden drop of loss around 20-30 epochs. We believe it will be interest-ing for future work to explore how POS/N performs on different downstream tasks, and why POS/N seems slightly worse than LEARN (at least on this MLM task), though theoretically it provides the complete positional information for Transformers. These topics will contribute to a deeper understanding of positional encodings and how Transformers leverage positional information to succeed on different tasks.

Discussion
In this paper, we theoretically and experimentally demonstrate that self-attention networks can process bounded hierarchical languages Dyck k,D , even with a memory advantage over recurrent networks, despite performing distributed processing of sequences without explicit recursive elements. Our results may explain their widespread success at modeling long pieces of text with hierarchical structures and long-range, nested dependencies, including coreference, discourse and narratives. We hope these insights can enhance knowledge about the nature of recurrence and parallelism in sequence processing, and lead to better NLP models. normalization and the third step follows from the ReLU activation gate, the fourth step comes from the residual connection and the last step can be obtained with an extra layer of MLP. We conclude the proof here.

B.2 Second Layer FFN
We can choose between k open brackets and the matched close bracket, with the exception on a few boundary cases: (1) The depth of the current bracket reaches the maximum; (2) The length of the sequence is about to reach the maximum. Let m i be the bracket type of the matched bracket at position i, we implement the last layer as follow.
We elaborate on a few details here.
Output mechanism The final output is determined by on V y T +2 , where V ∈ R 2k×2 log k +1 satisfies V i,1 = 0 and V i,1: is the binary encoding of the i-th close bracket and its complement when i ∈ {1, · · · , k}; V i,1 = log k and V i,j = 0 when i ≤ {k + 1, · · · , 2k} and j > 1. Let

B.3 Extension to Recognition task
Our construction can be adapted to recognition task with some extra efforts.

B.4 Extension to Dyck k
We can extend the above construction to recognize language Dyck k . Our construction bypasses the lower bound in Hahn (2020)  Due to space limits, we omit the detailed proof and only outline the major difference from the proof of Theorem 4.2.
1. We need position encoding i/n 3 instead of i/n, and add an extra position encoding of n.
2. For the first FNN, we replace D with n. In particular, for Lemma B.1, we need an extra input of n/i, this can be derived with either an extra attention head or an extra position encoding.
3. For the second FNN, we make some adjustment to the input of the EQUAL gate, since the gap between two input could be very small, i.e., O(1/n 2 ). Nevertheless, we can use the same trick of Lemma B.1 to amplify the gap between two input a, b to be of order Ω(1), the later one suffices to our purpose.

C Theoretical limits for finite position encoding
We prove that a Transformer with finite precision can not recognize Dyck k,D language. In fact, we show a stronger result: no transformer with o(log n) precision can recognize Dyck k,D language of length more than n.
Theorem C.1 (Formal statement of Theorem 4.3). For any k ∈ N, using hard attention, no transformer with o(log n) encoding precision can recognize Dyck k,2 language with input length n.
Our proof is inspired by Hahn (2020) but with several different technique ingredient: (1) we allow arbitrary attention masking (both future and past position masking); (2) we allow arbitrary position encoding (3) our lower bounds holds for bounded depth language Dyck k,D ; (4) we provide an quantitative bound for precision in terms of input length n. In general, our lower bound is incomparable with Hahn (2020), we prove a fine grained bound on the precision requirement for bounded depth language Dyck k,D , while the proof in Hahn (2020) applies only for language with Depth Ω(n) but allows arbitrary precision on position encoding.
The high level intuition behind our proof is that the attention head can only catch o(n) input positions when we properly fix a small number of symbol in the input sequence. This limits the capability of a Transformer and makes it fail to recognize Dyck k,D language.
We consider a L-layer transformer and assume 3H attention heads in total: H normal attention heads, H attention heads with future position masking, H attention heads with past position masking. To make our hardness result general, we allow residual connection for the attention layer, and we assume the FNN can be arbitrary function defining on the attention outcome. In the proof, we would gradually fix o(n) positions of the input sequence. We only perform the follow two kinds of assignment (1) we assign matching brackets to position i, i + 1 where i is odd; (2) we assign matching brackets (e.g., we assign '[', '(', ')', ']') to position i, i + 1, i + 2, i + 3 for odd i. A partial assignment to the input sequence is said to be well-aligned if it follows these two rules. Throughout the proof, we guarantee that for any i ∈ [n], ∈ [L], the output of the -th layer x i, depends only the input symbol at position i. This is clearly satisfied for = 0, given the it is composed by position embedding and word embedding only. We gradually fix the input and conduction induction on . We use c to denote the number of positions we fixed before the -th layer, and we use s to denote the number of consecutive assigned blocks of the input sequence. It is clear that s ≤ 2c . The following Lemma is key to our analysis. Due to space limits, we omit the detailed proof.
Lemma C.2. For any ∈ {1, · · · , L}, given a well-aligned partially assigned input sequence, suppose the input of -th layer x i, −1 depends on the symbol at position i only. Then by fixing c H 2 (k + 1) O( H) 2 O( Hp) additional positions of the input sequence, we guarantee that the output of -th layer x i, also depends solely on the symbol at position i.
Proof of Theorem C.1. We apply Lemma C.2 and compute the number of positions c L+1 we need to restrict, in order to guarantee that the output of L-th layer x i,L+1 depends only on the input at position We know the partial assigned sequence is wellaligned, has depth at most two, and the number of assignment is only 0.01. Thus, we assert that that when p = o(log n), the output of Transformer is completely determined by the partial assignment and it do not detect whether there exists error in the unassigned positions and thus can not recognize Dyck k,2 language. We conclude the proof here.  (Table 1) for a O(D 2 ) hitting time of different DFA states. The number of tokens for train, validation, and test set is 2 × 10 6 , 2 × 10 5 , 10 6 respectively.

Models
We use the LSTM model implemented in Hewitt et al. (2020). For Transformer models, we turn off all drop outs as we find them to hurt performance greatly. We also use only 1 head as we find more heads to hurt performance. We use Adam optimizer with initial learning rate being 0.01 or 0.001, and choose the better learning rate in terms of validation accuracy for each experiment.
We train for at most 100 epochs but allow early stopping if the validation loss converges.
Metric We follow Hewitt et al. (2020) and use the accuracy of correct close bracket predictions: Let p l be the empirical probability that the model confidently predicts a close bracket (defined as p( j | ) > .8), conditioned on it being separated from its open bracket by l tokens. Unlike Hewitt et al. (2020) where mean l p l is reported, we report E l p l for two reasons: (i) when l is large p l might be only defined by one trail, thus mean l p l amplifies the randomness; (ii) the findings remain similar with either metrics.

D.2 More Results
In Figure 7, we show the validation performance for Transformers of different positional encoding schemes. They all reach near-perfect accuracy when having at least 2 layers. In Figure 8, we break down the results in Section 6.2 when d model ∈ {10, 30, 50}. We also add results for a five-layer Transformer, which performs similarly as the two-layer Transformer. This shows (i) a two-layer Transformer, as suggested by our theory, is enough to process Dyck k,D , and (ii) Transformers with more layers can also learn to process Dyck k,D without overfitting or degraded performance.