Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension, thereby enabling efficient and successful modeling of regular languages such as PARITY. We further test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect deemed necessary in prior work for length extrapolation.


Introduction
It is long believed that Working Memory (WM), a term coined in 1960s to liken human minds to computers, plays an important role in humans' reasoning ability and the guidance of decision-making behavior (Baddeley and Hitch, 1974;Baddeley, 1992;Ericsson and Kintsch, 1995;Cowan, 1998;Miyake et al., 1999;Oberauer, 2002;Diamond, 2013;Adams et al., 2018).While no single definition encompasses all applications of WM (Adams et al., 2018), the following one should be shared by all of the theories of interest: Working memory is a system of components that holds a limited amount of information temporarily in a heightened state of availability for use in ongoing processing.-Adams et al. (2018) WM is instantiated in the two major driving forces of sequence modeling: Recurrent neural networks'(RNN) (Elman, 1990;Jordan, 1997;Hochreiter and Schmidhuber, 1997) short term memory modulated by their recurrent nature and gate design (Rae and Razavi, 2020a;Nematzadeh et al., 2020;Armeni et al., 2022), and Transformers' (Vaswani et al., 2017) salient tokens heightened by self-attention.
In reality, self-attention often attends broadly (Clark et al., 2019), violating the limited amount of information notion of WM.Our hypothesis is that such violation is to blame for Transformers' failure on algorithmic reasoning of regular languages (Deletang et al., 2023;Liu et al., 2023) such as PARITY, a seemingly simple task that checks if the number of 1s in a bit string is even.Surprisingly, a Transformer can only count the number of 1s correctly when the sequence length is held fixed at training sequence length T tr , and it fails miserably when the testing sequence length extrapolates to T ex > T tr (Hahn, 2020;Bhattamishra et al., 2020;Chiang and Cholak, 2022;Deletang et al., 2023;Liu et al., 2023).In contrast, an RNN can extrapolate perfectly.
The goal of this work is therefore to improve Transformers' WM by limiting the amount of accessible information at a time.Existing attempts that use a combination of scratchpad and recency biases (Wei et al., 2022;Nye et al., 2022;Anil et al., 2022;Liu et al., 2023) is not optimal as it completely foregoes the parallelization property of a Transformer, making it as computationally inefficient as an RNN.
This begs the question: Does there exist a more efficient Transformer working memory design?The answer is affirmative thanks to the proposed Reg-ularGPT, which boils down to the three design choices: Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention; Each of them has been proposed previously but it is the unique combination that sparks the successful and efficient learning of regular languages.We will further demonstrate its: 1) similar recursive parallel structure as linear RNN (Orvieto et al., 2023), resulting in log T tr or log T ex layers, and 2) generalizability by showing strong performance on the task of Transformer natural language length extrapolation (Press et al., 2022;Chi et al., 2022Chi et al., , 2023)).
In this work, we use [N ] to denote the list of non-negative integers [0, . . ., N − 1].The Transformer model used in this work is always causal.It takes in an input sequence of T ≤ T tr units (can be tokens or bits) σ i∈[T ] , passes them through a fixed amount of L transformer layers, and finally computes the distribution over the vocabulary V via the prediction head W o .

Regular Language and Algorithmic Reasoning
The Chomsky hierarchy (Chomsky, 1956b) classifies formal languages into different hierarchies based on their increasing complexity.Each hierarchy represents a family of formal languages that can be solved by the corresponding automaton.At the lowest level resides the family of regular languages, which can be expressed using a finite state automaton (FSA), a computational model comprising a set of states and transitions connecting them.Our primary objective is to enhance the algorithmic reasoning of the Transformer model on regular languages by testing its language transduction capability under the extrapolation setting.Concretely, the model is trained only to predict desired outputs on a set of short length-T sequences with T ≤ T tr .Still, it must also predict the correct outputs for longer testing sequences of length T ex T tr .It is worth noting that we evaluate our model via language transduction following recent work (Deletang et al., 2023;Liu et al., 2023), instead of the conventional language recognition protocol.Both settings are equally hard as they are underpinned by the same finite state semiautomaton.Interested readers may refer to Deletang et al. (2023) for further details regarding the two evaluation protocols.We also reveal the connection between RegularGPT and finite state semiautomaton later in §7.

Failure Mode and An Inefficient Fix
The PARITY task involves a length T bit string σ 1 σ 2 • • • σ T where each bit σ i is randomly sampled from a Bernoulli distribution with P(σ i = 1) = 0.5.The goal is to determine whether the sequence contains an even or odd number of 1s.
It has been observed that a Transformer is incapable of performing length extrapolation on PAR-ITY, but what could be its potential failure mode?Previous work sheds light on this by showing that a Transformer might settle on the naive-summation approach (Anil et al., 2022;Deletang et al., 2023;Liu et al., 2023).Concretely, it sums up all the bits and outputs the summation modulo 2. This approach fails since unseen summations will be produced when the model takes sequences of length T ex > T as input or P(S i ) deviates from 0.5.
To the best of our knowledge, the existing remedy (Liu et al., 2023;Anil et al., 2022) is to use scratchpad (Wei et al., 2022;Nye et al., 2022) along with recency biases (Press et al., 2022) to enforce the correct learning: They create a scratchpad that interleaves the sequence of input bits and intermediate answers (σ 1 , q 1 , σ 2 , q 2 , • • • , σ T , q T ), where The model is trained to predict all the σ i∈[T ] .Recency biases play the role of limiting a Transformer's receptive field to only a few most recent σ and q at every timestep i.This is to prevent self-attention from ignoring q and giving the same naive-summation solution.
Scratchpad and recency biases jointly create the notion of WM along the temporal dimension similar to RNNs, thereby enabling successful extrapolation on regular languages.Nevertheless, we note that this fix is inefficient during inference since all the intermediate answers q i have to be generated sequentially before reaching the final answer q T .A desirable fix should only take in the input bits (σ 1 , σ 2 , • • • , σ n ) and directly generate the final answer q T .In other words, our goal is to find an efficient WM design for a Transformer.

A Desirable Fix for PARITY (Figure 1)
An alternative solution to the PARITY problem is based on the spirit of divide-and-conquer, where we first divide the sequence into T /C chunks with each chunk of length C < T , and we compose the final answer by recursively merging the chunk outputs.This approach does not suffer from the unseen summation issue as the model was trained to handle a fixed amount of C bits at a time in its WM (chunk).It then recursively applies the already-seen results to compose the final solution when it encounters longer sequences during inference.More importantly, it is more efficient than the scratchpad and recency biases approach since it only requires log C T layers of parallel computations instead of 2T steps of sequential decoding.mn in eq. ( 1).The darkened blue cells represent the routing path to solve the result for the last bit specifically.As we can see, this approach requires at most log 2 T layers to obtain the result for a length T input sequence, rendering it a more efficient approach compared to the combination of scratchpad and recency biases.

Proposed Architecture of RegularGPT
We present our modifications to the vanilla Transformer below.Only the related operations will be expanded, and we follow all the other details of GPT2 (Radford et al., 2019).

Sliding-Dilated-Attention
A Transformer layer at layer l consists of a selfattention operation denoted as SA (l) and feedforward network denoted as FFN (l) .Originally, SA (l) computes the inter-token relationships across all T units.Instead, we set the chunk size to C and produce T /C non-overlapping chunks;1 Only the units within the same chunk inter-attend with each other.In practice, this can be achieved by an attention mask M (l) ∈ R T ×T at layer l.M (l)  shares the same shape as the self-attention matrix (see Figure 1) and is defined as: Note that M is a lower triangular matrix due to the causal nature of our model.r i 's with i ∈ [C] are learnable relative positional scalars.To be precise, each attention head has a different set of learnable biases r i 's.Here, we drop the dependency on the head for notational simplicity.The use of r i 's is similar to the positional scalars of T5 (Rae and Razavi, 2020a) except that we do not use the log-binning strategy over m − n.It is to facilitate the extraction of global information instead of enforcing the windowed-attention effect (Raffel et al., 2020;Press et al., 2022;Chi et al., 2022Chi et al., , 2023)).M will then be added to the original self-attention matrix, creating the proposed Sliding-Dilated-Attention effect.The output of SA (l) will be transformed by the positionalindependent FFN (l) to produce o The case of C = 2 is used as a possible construction of Theorem 1 in Liu et al. (2023).However, their focus is not on length extrapolation, hence lacking the below two proposed modifications.

Adaptive-Depth and Weight-Sharing
Since our Sliding-Dilated-Attention limits the number of accessible tokens at a time, we need an adaptive depth L = log C T so that the final output can utilize every single piece of input information.However, when T ex > T tr , the depth during inference will be higher than that during training.The simplest way to solve this challenge without further parameter updating is to perform Weight-Sharing across layers.To account for the possible performance loss due to Weight-Sharing, we first thicken the model by K times, resulting in a total number of K • L layers.Next, we share the weights across the K • L layers in the following way for k ∈ [K]: It can be equivalently interpreted as stacking more SA and FFN components within every Transformer layer, and the same thickened layer is reused L times.This layer thickening design is only used in the natural language modeling experiments in §6.

Where is the WM Notion?
Instead of instantiating WM along the temporal dimension as the combination of scratchpad and : This is the parallel scan algorithm that can accelerate a linear RNN.In this example, we visualize the routing path for computing x 8 .Blocks at the same layer can be computed in parallel on GPUs.
recency biases, RegularGPT limits the amount of information along the depth dimension.As we have seen, the idea of breaking T units into several chunks limits the amount of accessible information at each layer, thereby enabling the WM notion.A similar argument was made by Yogatama et al. (2021) in a sense that they categorized Longformer (Beltagy et al., 2020), a transformer variant with local attention pattern, as a model of working memory.Finally, thanks to modern accelerators such as GPU, all chunks at a layer can be processed concurrently, and this further makes RegularGPT more favorable over the scratchpad and recency biases approach.

Complexity Analysis
The

Connection to Prior Work
Sliding-Dilated-Attention This special attention pattern dates back to pre-Transformer era such as Wavenet (van den Oord et al., 2016) with dilated convolution.It can also be viewed as a special form of Longformer attention pattern with systematic dilation (Beltagy et al., 2020). 2 Limiting the 2 The original Longformer also adopts dilated attention on a few heads at higher layers but without the systematic pattern range of attention in lower layers of a Transformer is also corroborated in Rae and Razavi (2020b), where they find such design does not deteriorate the performance.
Adaptive-Depth and Weight-Sharing AL-BERT (Lan et al., 2020) and Universal Transformer (Dehghani et al., 2019) share the parameters across layers.The weight sharing design makes them compatible with the idea of Adaptive Computation Time (Graves et al., 2014) and Dynamic Halting (Dehghani et al., 2019;Elbayad et al., 2020), which allocate different computational budget depending on the complexity of tasks (Simoulin and Crabbé, 2021;Csordás et al., 2022).However, they lack the special Sliding-Dilated-Attention design that is necessary for ruling out naive solutions.(Orvieto et al., 2023) for k ∈ [T ] can be written as: where we set v k−j = Bu k−j .The operation can be accelerated by the parallel scan algorithm that permits efficient cumulative sum (Ladner and Fischer, 1980;Blelloch, 1990;Lakshmivarahan and Dhall, 1994;Martin and Cundy, 2018;Liu et al., 2023;Smith et al., 2023).As we can see in Figure 2, the routing path specified by the parallel scan algorithm is the same as our Sliding-Dilated-Attention illustrated in Figure 1.

Task Descriptions
We focus on the four tasks in section 1) (Deletang et al., 2023) of Table 1 as they will also be used in our analysis in §7.For tasks in section 2), please refer to Bhattamishra et al. (2020) for details.
Even Pairs A model needs to predict whether the total count of "ab" and "ba" pairs is even.In the example of "aabba", there is one "ab" and one "ba", resulting in a total count of 2, which is even.This task is equivalent to checking whether the first and last characters in a string are identical.Modular Arithmetic Given a sequence of numbers in {0, 1, 2, 3, 4} and operations in {+, -, •}, a model needs to compute the result modulo 5.For example, x = 1 + 2 − 4 evaluates to y = 4.
Parity Check A model needs to compute whether the number of bs in a given binary string is even.For example, the sequence x = aaabba contains 2 bs, which is even.
Cycle Navigation Given a sequence of movements on a cycle of length 5, a model needs to compute the end position.The possible movements are STAY, INCREASE, DECREASE encoded as {0, 1, 2}.The agent always starts at position 0. For example, 010211 means the agent stops at position 2 = 0 + 1 + 0 − 1 + 1 + 1.

Language Transduction and Extrapolation
First, we want to know if endowing a Transformer with the notion of WM really improves its length extrapolation capability on regular languages.We test RegularGPT and all the baselines on two sets of regular languages from prior work (Deletang et al., 2023;Bhattamishra et al., 2020). 3Prior work often 3 Our implementation is based on the codebase of Deletang et al. (2023) at: https://github.com/deepmind/neural_networks_chomsky_hierarchy.We additionally implement the regular languages in the second section reports the maximum score across different hyperparameter settings and random seeds because their goal is to know if a model can extrapolate at all.We additionally report the average scores since we want to know if the model can consistently obtain good performance.The baseline models we compare against are an RNN and vanilla Transformer with Transformer-XL style relative positional embedding (Dai et al., 2019).Table 1 shows that RegularGPT with C = 2 acheives similar performance as an RNN and substantially outperforms a vanilla Transformer.

The Effect of Chunk Size C
We vary the chunk size C of RegularGPT to see its impact on the performance.The motivation for using a larger C is to reduce the number of layers (i.e., L = log C T decreases in C) and increase the degree of parallelization.However, in Table 1, a larger C seems to pose a challenge to RegularGPT on the Modular Arithmetic task.Modular Arithmetic is a hard task with far more states and complicated state transitions.Increasing C is likely to increase the task difficulty by composing more state transitions at once.We will have an in-depth discussion of the theoretical reasons in §7.The extrapolation setting is from 41 to 500.Each entry is an average over 3 seeds.

Robust to Probability Changes
Other than the length extrapolation experiment, we alter the probability of sampling 1s of PARITY, i.e., set P(σ i ) = 0.5.The results in Table 2 show that RegularGPT is robust to different sampling probabilities, indicating its successful modeling of the underlying regular language grammar.In contrast, a vanilla Transformer model struggles to achieve good performance even for the same length setting, again validating the fact that it only finds the naive-summation solution as discussed in §2.2.

Natural Language Experiments
Given that RegularGPT has been battle-tested on the main experiment of regular languages, we now shift gear to benchmark its performance in the natural language scenario.Given a model trained on sequences of length T tr , we test it on much longer sequences of length T ex T tr during inference, and the goal is to observe similar perplexities.To optimize efficiency, we employ a random selection process to extract 1,000 chunks, each with T ex tokens from the testing set.Subsequently, we calculate the average perplexity of the last tokens within these chunks to ensure each of them has T ex − 1 tokens as the context, thereby avoiding the issue of early token curse (Press et al., 2022;Chi et al., 2023).We compare our model against the existing methods that are known to demonstrate the ability of length extrapolation including T5 (Raffel et al., 2020), ALiBi (Press et al., 2022), and KER-PLE (Chi et al., 2022). 4To counteract the loss of 4 We use the nanoGPT codebase: https: //github.com/karpathy/nanoGPT,and the Open-WebText2 dataset: https://huggingface.co/ datasets/the_pile_openwebtext2.expressive power due to weight sharing, we thicken each layer of RegularGPT to K as detailed in §3.
In Table 3, we first observe exploding perplexities for C = 32 after T ex ≥ 2048.RegularGPT might only learn to model log 32 512 = 2 layers during training, hence it fails to recursively model more than 32 2 = 1024 tokens during inference.This is validated by C = 64 since this time it is able to extrapolate until 64 log 64 512 = 4096.While the above argument seems to suggest large C, setting C = 256 also deteriorates the performance.This might be due to the limited number of chunks (512/256 = 2) and r i 's (in Eq. ( 1)) observed at the second layer, making the learning of r i 's harder.Overall, C is a hyperparameter that needs to be carefully decided for RegularGPT on natural languages.We also observe that 128/12 performs better than 128/6, implying RegularGPT's performance could be improved by stacking more layers to counteract the performance loss due to Weight-Sharing.
It is worth noting that 128/12 performs relatively well and is close to previous methods designed specifically for the task of natural language extrapolation.We will analyze its inner workings in depth in Figure 4 and §7, in which we find that Regu-larGPT learns the similar local receptive field as prior work, which is likely the key to its successful natural language extrapolation performance.

Regular Language and Finite State Semiautomaton
Regular language is the type of formal language recognized by an FSA (Chomsky, 1956a), which is a 5-tuple (Q, Σ, δ, q 0 , F ), where Q is a finite non-empty set of states, Σ is a finite non-empty set of symbols, However, some of our tasks are better modeled by a finite-state transducer (FST) as discussed in §2.1.To underpin both FSA and FST, we consider a semiautomation A = (Q, Σ, δ) (i.e., an FSA without q 0 and F ) and establish its connection to a Transformer model.Let σ a:b be the sequence from position a (inclusive) to b (exclusive) out of a length T input sequence (i.e., 0 ≤ a < b ≤ T ).We define A(σ a:b ) : Q → Q as the (b − a)-step state transition relation after receiving σ a:b ..42 24.38 24.90 32.03 30.30 28.94 26.91 34.38 2048 24.21 25.01 25.08 791.74 30.56 29.14 27.08 34.85 4096 24.53 28.91 25.08 812.00 30.80 29.25 27.28 35.11 8192 24.74 39.08 25.08 818.49 1175.91 29.41 27.39 35.42 Table 3: Natural language extrapolation results on OpenWebText2.The training length is 512.The numbers are averaged over three random seeds.Please refer to Appendix B for the detailed hyperparameters. where

Modeling Transition Composition
We want to show that the layers of RegularGPT with chunk size C = 2 can model the composition of two transition functions: This way, the regular language problem can be solved recursively using the construction outlined in §3 and Figure 1.To formalize the statement, we first observe that A(σ a:b ), A(σ a:i ), and A(σ i:b ) can be represented in R |Q| 2 : (2) where OneHot |Q| (i) is a one-hot vector of length |Q| with the i-th index being 1.
The next step is to mix A(σ a:i ) and A(σ i:b ) together and get A(σ a:b ).We show in Lemma 1 that a 2-layer ReLU network can learn (and so can a transformer layer) the composition.The proof of Lemma 1 is deferred to Appendix C.
Lemma 1 (Approximation for Binary Matrix Product).Let A, B ∈ {0, 1} n×n be binary matrices of dimension n × n.Then, there exists a two-layer ReLU network such that where Flat(X) (i−1)n+j = X i,j for i, j ∈ [n] is the operation that flattens a matrix into a vector.Now, we can relate Lemma 1 to the FFN layers in RegularGPT.Following §3, when chuck size C = 2 and thickness K = 1, the output vector o , which depend on input sequences σ i−2 l+1 +1:i−2 l +1 and σ i−2 l +1:i+1 , respectively.This observation implies that o (l) i likely models the transition function A(σ i−2 l+1 +1:i+1 ), which we denote as o ) is true, Lemma 1 implies that RegularGPT's FFN models the transition function composition.This is immediate by setting o )) and recognizing the fact that function composition is a matrix product under the representation of Eq. ( 2).
The next step is to explain the use of selfattention layers in RegularGPT.Although Lemma 1 has established a composition, it is unclear how the transitions are concatenated in the first place (i.e., [Flat(A), Flat(B)]).With a two-head selfattention and the learnable relative positional scalars, it is possible to adjust them so that the attention output contains the concatenated information [Flat(A), Flat(B)].
Recall in Eq. ( 1), each head has a different set of scalars r i 's.One concrete construction for concatenation is setting r 0 = 0 and the remaining −∞ for the first head; r 1 = 0 and the remaining −∞ for the second head.In other words, each head is only responsible for capturing one state transition.After the multi-head self-attention operation, we obtain the concatenation of two state transitions.
Finally, when the prediction head reads out the answer, the operation is equivalent to a mapping from A(σ 0:T ) ∈ R |Q|×|Q| to A q 0 (σ 0:T ) = A(σ 0:T ) • q 0 ∈ R |Q| .Since we assume that o  T −1 .Our tree-structured construction also guarantees that the final answer could be derived using log 2 T layers.

Verification of Transition Modeling
To verify whether our model learns the dynamics of a semiautomaton, we perform a clustering experiment to demystify the FFN output representations on the tasks of PARITY and Cycle Navigation.The two tasks are chosen as we can easily derive their state transition functions.For example, there are only two state transitions in PARITY: and five state transitions in Cycle Navigation5 : Given a testing input sequence of length 500 that is much longer than the training length 40, we extract the output o (l) i of all layers l, perform dimension reduction using PCA, and plot the dimensionreduced points on a 2D plane.Ideally, we want to see a limited number of clusters across all layers, indicating the model learns to capture the state transition function.As we can see in Figure 3

Receptive Field Analysis
We resort to the gradient analysis tool (Chi et al., 2023) to inspect the receptive field of RegularGPT on regular and natural languages.It computes a cumulative sum of the gradient norms starting from the most recent token to the earliest one.A large magnitude of slope at a position means the most recent token has a high dependency on that position.Ideally, we would like to see the receptive field covering the whole input sequence for the case of regular languages because every single bit in the input sequence is important for the final results.This is equivalent to a slanted line going from the lower right to the upper left, which is validated in Fig- ure 4a.As for natural language, we discover something interesting in Figure 4b in that RegularGPT settles on the local windowed-attention pattern as those enforced manually in prior work (Press et al., 2022;Chi et al., 2022Chi et al., , 2023)).This suggests the task of natural language modeling mostly needs only local context to achieve good performance, which aligns with the common belief.

Conclusion
This paper introduces RegularGPT, a novel variant of the Transformer architecture inspired by the notion of working memory that can effectively model regular languages with high efficiency.Theoretical explanations and accompanying clustering visualizations are presented to illustrate how RegularGPT captures the essence of regular languages.Moreover, RegularGPT is evaluated on the task of natural language length extrapolation, revealing its intriguing rediscovery of the local windowed attention effect previously observed in related research.Notably, RegularGPT establishes profound connections with various existing architectures, thereby laying the groundwork for the development of future Transformer models that facilitate efficient algorithmic reasoning and length extrapolation.

Limitations
Currently we set the chunk size C of RegularGPT to a constant.Can we make the chunk size more flexible?A flexible and data-driven C might further boost its performance on natural languages as they often demonstrate diverse patterns unlike regular languages underpinned by simple grammars.This might also improve the performance of Regu-larGPT when C = 128.

A Hyperparameters for the Regular Language Experiments
We report the hyperpamaters used in the regular language experiments (Table 1) in Table 4.

B Hyperparameters for the Natural Language Experiments
We report the hyperpamaters used in the natural language experiments (Table 3) in Table 5.

C Proof of Lemma 1
Lemma 1 (Approximation for Binary Matrix Product).Let A, B ∈ {0, 1} n×n be binary matrices of dimension n × n.Then, there exists a two-layer ReLU network such that where Flat(X) (i−1)n+j = X i,j for i, j ∈ [1, ..., n] is the operation that flattens a matrix into a vector.
The binary matrix product AB is composed of n 3 binary scalar products of the form: A ik B kj = x (i−1)n+k x (n+k−1)n+j for i, j, k ∈ [1, .., n], where x = [Flat(A), Flat(B)] is the concatenated flattened input.Our goal is to construct two neural network layers.The first layer computes all n 3 binary scalar products.The second layer sums these products into the form of matrix product; i.e., n k=1 A ik B kj .

Figure 1 :
Figure1: This is the divide-and-conquer approach that solves the PARITY problem.The lightly shaded blue cells represent M (l)

Figure 3 :
Figure 3: Clustering of FFN output vectors across all layers via PCA on the task of PARITY and Cycle Navigation.
, PARITY has 2 clusters and Cycle Navigation has 5 clusters.The clear clustering effect demonstrates Reg-ularGPT's correct learning of state transition functions.This is in contrast to the naive-summation approach learned by a vanilla Transformer as shown in Figure B.4 of Deletang et al. (2023).

f
mlp ([Flat(A), Flat(B)]) =ReLU [Flat(A),Flat(B)]W (1) −1 n 3 W (2)= Flat(AB).D Illustration ofLemma 1 D.1 Illustration of the Binary Weight Matrices We illustrate W (1) and W (2) of Lemma 1 as follows: import numpy a s np d e f get_W1 ( n ) : n2 = n * n W1 = np .z e r o s ( ( 2 * n * n , n * * 3 ) , d t y p e = i n t ) f o r i i n range ( n ) : f o r j i n range ( n ) : f o r k i n range ( n ) : W1[ i * n+k , i * n2+ j * n+k ] = 1 W1[ n2+k * n+ j , i * n2+ j * n+k ] = 1 r e t u r n W1 d e f get_W2 ( n ) : e y e = np .e y e ( n * n , d t y p e = i n t ) o n e s = np .o n e s ( ( n , 1 ) , d t y p e = i n t ) W2 = np .k r o n ( eye , o n e s ) r e t u r n W2 get_W1(2) gives:

Table 1 :
Length generalization results on Regular Languages (Max/Avg).All models in the first section (Deletang et al.) are trained on sequences of length 40.The reported numbers are the average of length extrapolation results from 41 to 500.Each result is an average over 3 seeds.All models in the second section(Bhattamishra et al.)are trained on sequences of length 50.The reported numbers are the average of length extrapolation results from 51 to 100.Each result is an average over 3 seeds.Please refer to Appendix A for the detailed hyperparameters.

Table 2 :
We alter the probability P(σ i = 1) used to sample 1s of PARITY.The same length setting is 40.