Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer’s effective generalization performance despite relatively limited expressiveness.


Introduction
Transformers (Vaswani et al., 2017) have supplanted recurrent models across a range of NLP tasks (Brown et al., 2020, Liu et al., 2019 as well as other areas of machine learning (Bahri et al., 2020, Fakoor et al., 2020. In particular, effective large-scale pretrained models have predominantly been Transformer-based models and have found application in other areas such as computer vision and protein folding. Given the irrefutable importance of understanding these architectures, a significant effort has been devoted to analyze the inner workings of large-scale pretrained Transformers. However, the cause behind the difference in performance between Transformers and recurrent models has largely been unclear. A line of work has attempted to understand neural sequence models through the lens of formal language theory. These works have sought to formally understand the expressive power of these architectures and identify differences in their practical abilities to generalize across various kinds of formal languages. A notable result by Hahn (2020) showed that Transformers are limited in their ability to express the Parity language 1 while it is well known that small-sized recurrent models can express such languages. Across empirical studies, Transformers have been found to perform worse or comparably to LSTMs in almost all formal languages previously considered in the literature . In particular, Transformers have been shown to struggle with the Parity language and certain other regular languages. This leads to a natural question: Why do Transformers perform so well in practice if they are arguably less expressive and perform worse than recurrent models across certain formal languages?
Although recurrent models such as LSTMs have been shown to perform better on formal languages such as Parity, we find that they struggle to generalize well on several sparse Boolean functions such as Sparse Parities. We find a clear contrast between the generalization abilities of Transformers and LSTMs on various k-sparse Boolean functions which have low sensitivity. Additionally, through extensive empirical analysis, we provide strong evidence to suggest differences in the bias towards low complexity functions between Transformers and recurrent models. Based on our results, we hypothesize that one of the reasons behind Transformer's practical effectiveness could be that they are more biased towards simple functions in comparison to recurrent models which may lead to better generalization.
In particular, we focus on a complexity measure called sensitivity (Kahn et al., 1989), which measures how likely it is that a function value changes due to a 'small' change in input. Sensitivity is related to several other complexity measures; functions with low sensitivity have low Kolmogorov complexity, 2 simpler Fourier spectra, and can be represented by decision trees of small depths. The relationship between sensitivity and generalization has also been previously studied in the literature (Franco, 2006, Novak et al., 2018. 3 While measures such as Kolmogorov complexity are uncomputable, sensitivity can be tractably estimated and extensions of sensitivity can be used to estimate the complexity of functions in more realistic NLP tasks (Hahn et al., 2021).
Our Contributions. We investigate the bias in (a) parameter space by analyzing randomly initialized models, (b) learning procedure by examining the sensitivity of models during the training process, and (c) trained models by evaluating the performance of models on functions of low sensitivity. Our contributions can be summarized as follows: (i) We demonstrate that random Transformers are significantly more likely to represent functions of lower sensitivity 4 than recurrent models when the weights are sampled uniformly or according to Normal distribution (see Figure 1, bottom right). When the weights are initialized following practical strategies (such as Xavier normal), then both Transformers and LSTMs are likely to have low sensitivity with Transformers having relatively lower sensitivity.
(ii) We show that both Transformers and LSTMs learn functions of increasing sensitivity when trained on a set of Boolean functions as well as practical datasets such as sentiment classification (see Figure 1, top right). For Boolean functions, Transformers converge to functions of lower sensitivity in comparison to LSTMs when both models achieve near-zero training error.
(iii) On various k-sparse Boolean functions, we find that Transformers learn to generalize near-perfectly even in the presence of significant noise in the training data whereas LSTMs severely overfit and obtain poor generalization performance (see Figure 1, left).

Related Work
Simplicity bias and deep learning. Given the incredible capacity of neural networks to express arbitrarily complex functions (Cybenko, 1989, Hornik et al., 1989, their practical ability to generalize well instead of overfitting has remained a mystery. One approach to explaining deep learning's unexpected generalization performance has been to study the inductive biases of random neural networks. Several prior works have shown theoretically (Palma et al., 2019) and empirically (Valle-Perez et al., 2019) that random untrained feedforward networks are biased towards 'simple' functions. An effective tool to bound the generalization error of deep learning algorithms has been PAC-Bayesian bounds which depend on the KL-divergence between the distribution over functions generated by a learning algorithm and a prior probability distribution. Valle-Perez et al. (2019) showed that considering the distribution over functions generated via random neural networks leads to better bounds than traditional ones. Several works (Cohen et al., 2019, Lee et al., 2017, Mingard et al., 2019, Wilson and Izmailov, 2020 have argued using heuristic methods that the inductive biases in random neural networks can be used to understand the properties of trained networks. Additionally, there is empirical and theoretical evidence (Oymak and Soltanolkotabi, 2019) that neural networks trained with SGD usually converge close to the initialization point. Hence, understanding the properties of random neural networks is imperative to understand their generalization abilities. 5 In Section 4.1, we study the complexities of random Transformers and recurrent models and investigate the differences between them. Formal Languages and Sequence Models. In the past few years, a strand of work has attempted to understand neural sequence models' capabilities and inner workings by analyzing them on formal languages, e.g. (Sennhauser andBerwick, 2018, Suzgun et al., 2019b). Given the recent success of Transformers, several works have sought to investigate them via the lens of formal languages. Hahn (2020) theoretically showed the limitations of Transformers in recognizing languages like Parity and Dyck-2. Following that, various recent works (Hao et al., 2022, Merrill et al., 2022 have associated the expressive power of Transformers with different classes of circuits depending on the restriction on resources such as precision, type of attention, etc. Barak et al. (2022) sought to theoretically explain how ReLU networks trained with SGD learn sparse parities and they also empirically observe that Transformers are able to learn sparse parities. Some of our results corroborate their findings and we empirically explore the phenomenon further.
While Transformers are expressive enough to represent the Parity language for bounded lengths , multiple works have observed that they struggle to generalize well when tested empirically . Furthermore, Transformers have been shown to achieve poor generalization performance in comparison to LSTMs on a certain subclass of regular languages in formal language recognition  and instruction learning (Finlayson et al., 2022) tasks. In contrast to this, we show that when evaluated on some simpler variants of these formal languages, Transformers generalize near perfectly whereas LSTMs achieve poor generalization performance.

Sensitivity of Boolean Functions
We will work with a complexity measure called Boolean Sensitivity which has been widely studied in computational complexity (Ambainis et al., 2014, Kahn et al., 1989, O'Donnell, 2021. We first define the relevant notions of complexity and then discuss their relation to other complexity measures. For Boolean functions defined over the Hamming cube, sensitivity captures how many neighbours of a particular input have different outputs. Formally, the sensitivity of a Boolean function f : where I denotes the indicator function and x ⊕i = (x 1 , . . . , x i−1 , 1 − x i , x i+1 , . . . , x n ) is the same as x at every coordinate or bit except the i-th one. The maximum sensitivity of a function f is defined as ms(f ) = max x∈{0,1} n s(f, x). The average sensitivity (also referred to as total influence) of a Boolean function measures the average of the sensitivity of the function across all inputs x ∈ {0, 1} n and is defined as See that 0 ≤ s(f ) ≤ ms(f ) ≤ n. To compare across inputs of different lengths, in our experiments we will normalize the average sensitivity across length S(f ) = 1 n s(f ) which can also be interpreted as, where [n] = {1, . . . , n} and the sampling is over uniform distribution over the domains.
Parity. The Parity function over {0, 1} n is defined as Parity(x) := (−1) n i=1 xi . For any input x ∈ {0, 1} n , the function Parity has value +1 if the number of ones in the input is even and has value −1 otherwise. The sensitivity of the Parity function is the maximum among all functions over {0, 1} n since changing any bit of any input changes the function value. Hence, for Parity over {0, 1} n , s(Parity) = n and S(Parity) = 1.
Sparse Boolean functions. Another class of functions are the k-sparse functions (also referred to as k-juntas) where the function value depends on at most k coordinates of the input. More formally, a function f : Let SPARSE-(k, n) be the class of k-sparse functions on inputs of length n that depend on at most k bits. It is easy to see that, for any f ∈ SPARSE-(k, n), the average sensitivity s(f ) ≤ k (and hence S(f ) ≤ k n ). When k n, SPARSE-(k, n) can be seen as a subclass of all Boolean functions with low average sensitivity. Other functions with low average sensitivity can also be approximated with k-sparse functions using Friedgut's Junta Theorem (O'Donnell (2021), Page 269). The maximum average sensitivity s(f ) = k is attained by Sparse Parities denoted f parity k which is the Parity over a subset of k coordinates. A sparse parity function f parity k over S ⊆ [n], s.t. |S| = k is +1 if the number of ones in the coordinates S is odd and −1 otherwise. Other Boolean functions such as sparse majority can be defined similarly. The majority function f maj over {0, 1} n is +1 if the number of ones in the input is greater than the number of zeros and is −1 otherwise. Similarly, the sparse majority function f maj k is the majority function over coordinates S ⊆ [n], s.t. |S| = k. Parities (and Sparse Parities) are an important class of Boolean functions since any Boolean function can be represented as a linear combination of a set of Parity functions.

Why Sensitivity?
Sensitivity can be seen as a discrete analog (Gopalan et al., 2016) of the 'smoothness' of a continuous function which measures how gradually a function changes locally. Functions of higher sensitivity can be considered more complex since the function value can be changed by changing any of a large subset of bits whereas functions of lower sensitivity depend on fewer bits and their function value can be determined based on a small number of input coordinates. Sensitivity measures are also polynomially related to several other notions of complexity such as the depth of a decision tree, certificate complexity, and the degree of the Fourier expansion of Boolean functions (see Ambainis et al. (2014) for more details). The correlation between generalization and a different notion of sensitivity has been demonstrated in Novak et al. (2018) for computer vision models. The relation between generalization and a variant of Boolean sensitivity has even been explored over a decade ago by Franco (2006). More recently, Hahn et al. (2021) extend the notion of block sensitivity to incorporate variable length sequences and propose it as a measure to estimate the difficulty of various NLP tasks.
Auxiliary Results. Although not the primary focus of the paper, we explore some relations between sensitivity and generalization in Appendix D. In particular, we show how sensitivity can be seen as a capacity measure to derive generalization bounds. Additionally, we explore the correlation between sensitivity and generalization gap for LSTMs and Transformer-based models on sentiment classification tasks. We also study the correlation between average sensitivity S and three other complexity measures previously studied in the literature, namely Entropy (Mingard et al., 2019), critical sample ratio (Arpit et al., 2017), and the size of the smallest Boolean expression representing the function.

Sensitivity Experiments
In this section, we conduct various experiments to investigate the differences in the bias of Transformers and RNNs towards functions of low sensitivity. 6 From here onward, whenever sensitivity is mentioned, we will refer to the length normalized version of average sensitivity S defined in Eq. (3). The first part of this section deals with analyzing the sensitivity of random Transformers and RNNs while the second part investigates the sensitivity of models trained to fit random Boolean functions.

Sensitivity of Randomly Initialized Models
We seek to understand the landscape of the complexity of functions in the parameter space of Transformers and RNNs. Let us assume that the parameter space Θ of our models is bounded, i.e. all the parameters (weights) take some value within some bounded range [−B, B]. A particular realization of the parameters with values in [−B, B] leads to the model being a function from {0, 1} n → {0, 1}. We begin with a simple question: Out of all the parameterizations in the parameter space of Transformers (or RNNs), if we select one uniformly at random, then how likely is it to have low sensitivity?
Setup. In all our experiments, we consider binary classifiers with Transformers and RNN-based architectures. By Transformer, we refer to the encoder-only version of the original Transformer architecture (Vaswani et al., 2017) as used in models such as BERT (Devlin et al., 2019). The model takes a sequence of tokens along with a [CLF] token as input. The final classification is done based on the output vector of the [CLF] token. For recurrent models, we consider LSTMs (Hochreiter and Schmidhuber, 1997), GRUs, and RNNs with tanh activation. Most of the results in the main paper pertaining to recurrent models are based on experiments with LSTMs and we discuss when the behaviour is different for other recurrent models.
In our experiments, we explore four strategies to sample random networks: Uniform, Gaussian, Xavier uniform, and Xavier normal initialization. In uniform sampling, each parameter (weights and biases) is assigned a value by uniformly sampling in [−10, 10]. Similarly, for Gaussian initialization, each parameter is assigned by sampling from N (0, σ 2 ) where we set sigma as 10. Xavier normal (Glorot and Bengio, 2010) initialization is the one that is more commonly used in practice to train these models. All the weights are initialized with N (0, σ 2 ) where the standard deviation σ = d −1/2 where d is the number of hidden units. All the input embedding vectors and positional embedding vectors are initialized with N (0, 1) which is the default scheme in PyTorch (Paszke et al., 2019). For input lengths greater than 10, we estimate the sensitivity of each model by computing the average over a sampled set of bit strings. We sample 10k bit strings and compute the average sensitivity across the samples. For each hyperparameter configuration, we sample 75-1000 different models to estimate their sensitivity depending on the computational costs associated with it. For most of the results reported in the main paper, we consider bit strings of length 20. But we also experiment with lengths ∈ {5, 7, 10, 15, 20, 50, 100, 200}.
Results. Figure 2 (upper row) shows the distribution of sensitivity for uniformly initialized Transformers and LSTMs. The distribution for Transformers is heavily skewed towards functions of very low sensitivity in comparison to LSTMs. The pattern holds across Gaussian initialization as well (see Figure 1, bottom right). For initialization strategies used in practice such as Xavier normal and Xavier uniform, we find that both Transformers and LSTMs have low sensitivity (see Figure 2, lower row and Figure 10) with Transformers having relatively lower average sensitivity. Refer to Section B.2 in the Appendix for results with Xavier uniform initialization. Although we primarily discuss results with sensitivity in the main paper, similar experiments with other complexity measures are presented in Appendix B.1.
Finding Parity. For strings of length 5, the total number of possible functions f : {0, 1} 5 → {0, 1} is 2 2 5 . If Boolean functions are sampled uniformly, then the probability of picking the Parity function is less than 1 in two billion. However, based on uniformly sampling 10 million LSTMs of depth 2 and hidden size 8, we found that the probability of finding one that represents Parity over length 5 is 1 in 30,000. Hence, it is over 60,000 times more likely to find the Parity function by sampling LSTMs in comparison to randomly sampling Boolean functions. This indicates that the parameter space of recurrent models such as LSTMs has a significant representation of Parity functions which might help explain why it is easier for them to learn Parity. On the other hand, for Transformers, we did not find a single sample which represented Parity based on 10 million samples.
Discussion. These results imply that lower sensitivity functions are over-represented in the parameter space of Transformers. If every Boolean function f : {0, 1} n → {0, 1} would have had equal representation in the parameter space of the model, then the distribution would have concentrated around 1/2. A learning algorithm based on a random search over the parameter space is more likely to encounter functions of low sensitivity. Note that, while recurrent models have higher sensitivity than Transformers, they are still lower than randomly sampling a Boolean function.
While it is not entirely clear why random Transformers are relatively more biased towards low complexity functions, we observe that they behave similar to hard-attention Transformers upon inspection of attention weights. Recent works (Hahn, 2020, Hao et al., 2022 have shown that hard-attention Transformers can only represent functions in AC 0 (which contain functions that can be represented by constant depth AND/OR circuits). Since AC 0 circuits can only represent functions of low average sensitivity (O'Donnell, 2021), it might help explain why random Transformers have low sensitivity.
Change across hyperparameters. For uniform sampling, a general observation for both the architectures is that the likelihood of higher sensitivity functions increases with the number of layers (see Figure 3, left and Figure 9), however, even for Transformers with depth 12, the distribution is heavily skewed towards low sensitivity functions in comparison to recurrent models with depth 1. Unlike recurrent models, the sensitivity of Transformers decreases when the width of the model is increased (see Figure 3, middle). For Transformers, the average sensitivity decreases with the increase in the length of the strings (see Figure 3, right), whereas for LSTMs, it remains quite high even for lengths up to 200 (see Figure 9 in the Appendix).

Models learn functions of increasing sensitivity
In this section, we investigate the sensitivity of functions learned during the training process when Transformers and LSTMs are trained to fit datasets of Boolean strings with random labels. Setup. We create datasets of size 1k each by uniformly sampling bit strings of length 40. The label for each input string is assigned randomly (+1 or −1 with probability 1/2). All the weights of the models are initialized with Xavier normal initialization and the biases are initialized with zero vectors. We consider Transformers and LSTMs across various hyperparameter configurations with a similar number of parameters. We train the models until they reach zero training error and estimate the sensitivity of the models at every epoch. We conduct the experiments over 20 different datasets with 100 runs for Transformers and LSTMs each.
Sensitivity during training. We find that both Transformers and LSTMs gradually learn functions of increasing sensitivity with Transformers converging to functions of much lower sensitivity than LSTMs (refer to Figure 1, top right). Note that, even if LSTMs converge on functions with higher sensitivity, it is still lower than the underlying labelling function which has sensitivity concentrated around 0.5. We observe similar behavior when the models are trained on random sparse Boolean functions and sparse parities. Even though sensitivity is defined over Boolean functions, we explore a few natural extensions to estimate the sensitivity of models trained on real datasets such as sentiment classification. On two sentiment classification datasets, namely SST (Socher et al., 2013) and IMDB (Maas et al., 2011), we found similar observations where both Transformers and LSTMs seem to incrementally learn functions of increasing sensitivity. See Appendix C for more details.
Discussion. Even if sequence models such as Transformers or LSTMs are capable of representing arbitrary functions, our results suggest that they prioritize learning simpler patterns first. These results echo prior observations that indicate feedforward-like neural networks trained with SGD learn functions of increasing complexity (Arpit et al., 2017, Nakkiran et al., 2019. Rahaman et al. (2019) find that ReLU neural networks learn functions of lower frequency modes first. Functions with lower average sensitivity also have a lower frequency and hence these observations are closely connected. More importantly, average sensitivity can be naturally extended to real data which allows us to empirically explore this for text data.
Sensitivity upon convergence. For Transformers and LSTMs trained until 0% training error, we estimate the sensitivity of functions learned by the models. We create 15 datasets and for each dataset, we compute the sensitivity of 100 trained models. The combined distribution of the sensitivity of the models across all datasets is shown in Figure 4a. We observe that Transformers consistently learn functions of lower sensitivity in comparison to LSTMs. This supports our hypothesis that for Transformers the parameter search via algorithms such as Adam is more likely to find functions of lower sensitivity that fit the training set as opposed to LSTMs.

Experiments on Sparse Boolean Functions
Our results in the previous section indicate that relative to LSTMs, random Transformers are biased towards low-sensitivity functions and Transformers are biased towards learning Boolean functions of low sensitivity. Motivated by this difference in bias, we conduct experiments geared towards answering the following question: Is there any difference between the ability of Transformers and LSTMs to learn sparse Boolean functions which have low sensitivity?

Setup.
Boolean Functions. We focus on k-sparse Boolean functions which have low sensitivity when k n (refer to Section 3 for definition). We first consider certain Boolean functions which are widely studied in the analysis of Boolean functions. The first one is Sparse Parities which can be interpreted as the k-sparse variation of standard parity. We denote an instance of Sparse Parities as Parity-(n, k) where n denotes the length of the input string and k denotes the number of relevant bits. We denote an instance of standard Parity as Parity-n where n denotes the length of the input string and the output is computed based on the number of ones in all indices. Learning Parity-(n, k) with gradient-based methods has well-known hardness results − requiring at least n Ω(k) computational steps to find the correct target function (Kearns, 1998). The other two Boolean functions we consider are sparse majorities (denoted by Maj-(n, k)) and the dictator function (denoted by Dict-n). The output of the dictator function depends only on a single input bit, making it arguably one of the simplest Boolean functions with very low sensitivity. In Maj-(n, k), the output for a string of length n is determined by whether the number of ones is greater than the number of zeros in the k relevant indices.
The second set of Boolean functions we consider is random k-sparse functions (denoted by Juntas-(n, k)). For each instance of Juntas-(n, k), the function is determined by randomly choosing k indices and assigning labels to each of the 2 k distinct inputs randomly. 7 Noisy Labels. We also conduct experiments to examine the ability of the models to learn in the presence of noise. In these experiments, labels of training data are flipped with a certain probability η. Thus about 1 − η fraction of the training data is clean and η fraction of the training data has incorrect labels. The validation set is clean without any modifications. The goal is to investigate whether a model is robust to noise during the training process.
Training Details. The training and validation sets are created by uniformly sampling bit strings over {0, 1} n . In our experiments, we consider Transformers with 1-6 layers, 4-8 heads, and width (usually referred to as d_model) within 8-128. We consider Transformers with both learnable and absolute positional encodings. For LSTMs, we consider up to 6 layers and widths (also referred to as hidden_size ) within 8-256. The size of the token embeddings is kept the same as the width. We use batch sizes of 100 and 500 in all our experiments and tune across learning rates ∈ {1e-1, 5e-2, . . . ,1e-5, 1e-6}. For each dataset, we extensively tune the models across various hyperparameter settings, details of which are provided in Section G in the Appendix.

Experiments
Parities. For Parity-(40, 4) and Parity-40, we create 5 different datasets and report the results based on the maximum accuracy achieved on unseen test data. The training set consists of 30k samples and the validation sets contain 10k samples.
We observe a stark contrast between the performance of Transformers and LSTMs on different forms of parity tasks. We find that Transformers struggle to fit and generalize on Parity-40 while LSTMs easily (across a range of hyperparameters) generalize well on them. On the other hand, perhaps surprisingly, on Parity-(40, 4), we find that while Transformers generalize well, LSTMs severely overfit and achieve poor validation accuracy. Although LSTMs achieve 100% training accuracy over the training data, their validation accuracy does not move far beyond the chance level (50%). Figure 5 depicts the training and validation accuracy curves for Transformers and LSTMs on Parity-(40, 4) task. We find similar behaviour for LSTMs even with learnable positional embeddings. Additional results on Parity-(n, k) across different dataset sizes, variations of tasks, and parameters are provided in Appendix E.  Robustness to noise. On Sparse Parities datasets, we find that Transformers are surprisingly robust to noise. When the training data contains 5%-20% noise (η), Transformers achieve perfect generalization accuracy with training accuracy converging at 1 − η. In some cases, after training for a large number of iterations post-convergence, Transformers begin to overfit on the noise. This observation echoes a similar finding in Tänzer et al. (2022) where they observed such behaviour while finetuning large pretrained models for sequence tagging tasks in the presence of noise. The training and validation accuracy curves are provided in Figure 6. The behaviour of recurrent models is the same as in the previous scenario with clean data: they overfit on the training data while achieving chance level validation accuracy.
We observe this pattern across other sparse Boolean functions such as sparse majority and dictator functions as well. For sparse majority datasets Maj-(n, 5), we consider lengths n ∈ {50, 75, 100, 200} and for dictator functions Dict-n, we consider lengths n ∈ {100, 200, 300, 500, 700}. We experiment with various rates of noise (10 -30%). While LSTMs do generalize well up to certain lengths, they achieve poor validation accuracy (<75%) as the lengths go higher. At the same time, they obtain 100% training accuracy on all the datasets. The validation accuracies of LSTMs are reported in Figure 7. Transformers on the other hand are able to achieve near-perfect generalization even in the presence of significant noise.
Random k-sparse functions. For Juntas-(n, k), we experiment with various datasets for Juntas-(n, 5) with n ∈ {30, 50, 80, 150, 200}. For lengths n < 150, we find that LSTMs generalize well on some of the Juntas-(n, 5) functions. However, in the presence of 10% noise (i.e., η = 0.1), their performance degrades sharply. We create 10 datasets for Juntas-(50, 5) with η = 0.1, and similar to previous scenarios, LSTMs struggle to generalize well (>75%) whereas Transformers are able to generalize perfectly on all the datasets (see Figure 1, top middle). However, even when the validation accuracies of LSTMs were below 75%, their training accuracy reached 100% indicating that they overfit on the training data. Figure 1 (bottom left) shows the training and validation curves of LSTMs on the 10 datasets.
Sensitivity During Training. We observe that on k-sparse functions, both Transformers and LSTMs learn functions of increasing sensitivity. However, when LSTMs overfit and reach zero training error, they converge to functions of much higher sensitivity than Transformers. Since Transformers generalize nearperfectly, their sensitivity upon convergence matches that of the target function. Figure 4b depicts the sensitivity of both models when trained on Juntas-(50, 10) datasets with 10% noise. The behaviour of sensitivity during training is similar for Parity-(n, k) datasets. This indicates that while both models can find a function (through gradient-based methods) that perfectly fits the training data, LSTMs are biased towards converging on more complex functions that have higher sensitivity.
Phase Transitions. As reported in Barak et al. (2022), we observe phase transitions on Parity tasks where the training and validation accuracies do not change for a large number of training iterations and then abruptly reach near-perfect accuracy in a few iterations (see Figure 6). This phenomenon was observed for feedforward networks (FFNs) and Transformers in Barak et al. (2022) and theoretically explained for ReLU FFNs trained with SGD. We observe another such behaviour for LSTMs on Parity-n (see Figure 15 for training curves on Parity-30). For both LSTMs and Transformers, we were unable to get them to generalize well with SGD on either Sparse Parities or standard Parity. Both the architectures seem to succeed with the Adam optimizer (Kingma and Ba, 2014).
Grokking. Another interesting phenomenon we observe in Transformers is that in some cases the training accuracies start increasing gradually with no change in the validation accuracy. After some iterations, the validation accuracy increases and matches the training accuracy. We reliably observed this phenomenon while training Transformers with absolute positional encodings across training sets of various sizes (see Figure 16) and while training with learnable encodings on small-sized training sets (see Figure 17). Similar observations for grokking (Power et al., 2022) were made in Barak et al. (2022) for ReLU FFNs trained on small-sized training sets.

Clarifications
(1) Do our results imply that Transformers can learn any k-sparse functions with small (practical) number of examples? No. For small lengths (n = 50) and k = 3, we could enumerate and verify that they are able to learn all functions in the presence of 10% noise. However, as the length n and the number of relevant bits k grow, Transformers struggle to perform well. Given the computational hardness associated with learning Sparse Parities, the task becomes much more difficult with the increase in n and k. For n = 100 and k = 5, we were not able to obtain good generalization performance with Transformers.
(2) Do Transformers never overfit on k-sparse functions? They do overfit when the size of the training data is very small. For Sparse Parities with n = 40 and k = 4, it is perhaps surprising that Transformers learn the correct function even with as little as 2500 training examples in less than 10000 computational steps. However, for training sets of size 1000, Transformers overfit across all runs. Additionally, with training sets of size 5000 -10000, Transformers with higher depths overfit in some cases. See Appendix E for more details.
(3) Does the low sensitivity bias of Transformer (Section 4) explain their good generalization performance on k-sparse functions such as Sparse Parities? No. Our findings in Section 4 motivated us to compare the performance of Transformers and LSTMs on functions of low sensitivity such as k-sparse functions. While the bias towards low sensitivity functions and strong performance on various k-sparse functions could be related, it is not a direct explanation for their performance on k-sparse. For Transformers on Sparse Parities, it is natural to expect them to follow some mechanism along the lines presented in Barak et al. (2022) for FFNs trained with SGD. However, the exact details are not clear, and more importantly, why and how LSTMs overfit is unclear as well.
(4) Are Transformers performing better than LSTMs because of learnable positional embeddings? This seems unlikely since we found that Transformers with absolute positional encoding also generalize well on sparse parities (see Figure 16). Additionally, we found that LSTMs with learnable positional embeddings also fail to generalize on sparse parities and behave similarly to Figure 5.
(5) Do LSTMs never succeed in learning Sparse Parities from data? They do succeed for smaller lengths. For lengths up to 20, we find that both Transformers and LSTMs are able to learn Parity and Sparse Parities. However, for higher lengths, Transformers struggle to fit Parity and LSTMs begin to overfit on Sparse Parities. For length n = 20 and k = 4, we could robustly find that even LSTMs without positional embeddings succeeded in learning sparse parities. On the other hand, for n = 40 and k = 4, we robustly found that LSTMs with learnable positional embeddings overfit and achieve poor generalization performance. Transformers were able to generalize well in the presence of noise across various hyperparameters for Sparse Parities with n = 40 and k = 4. Our goal is not to identify the exact class of functions that Transformers can learn in practice. The key result is the juxtaposition of the performance between Transformer and LSTMs across various k-sparse functions.

Discussion and Final Remarks
We discuss some natural questions regarding the broader implications of our results. (a) Are Transformers performing better since the tasks are more suited to their architecture? Perhaps yes. One could argue that a number of regular languages that Transformers struggle to learn (Bhattamishra et al., 2020a, Chiang andCholak, 2022) are more suited toward recurrent architecture. Transformers have been shown to perform poorly on languages that require modular counting. Finite state automata which are often considered to be formal abstractions of recurrent models can represent these more efficiently. For instance, languages like standard parity can be represented with a two-state DFA while representing sparse parities would require a larger number of states. On the other hand, for circuits that have recently been related to Transformers (Hao et al., 2022), representing sparse parities would be easier than representing standard parity. Our results indicate that previous works might have overestimated the performance of LSTMs by considering regular languages which are more suited for autoregressive architectures. (b) Do Transformers work effectively in practice primarily due to their simplicity bias? It is hard to answer this question. In our work, we try to highlight differences between Transformers and LSTMs with respect to certain properties which have close connections to generalization. While these properties could partially be the reason behind their good generalization performance, it is also possible that they are ubiquitous in practice because they effectively model long-distance dependencies and can be trained efficiently.
The question of which formal languages are more closely associated with practical tasks is not entirely clear. Prior works on analysis with formal languages have primarily followed Chomsky hierarchy owing to the conjecture that natural languages are mildly context-sensitive. While regular languages such as Parity have high sensitivity (S = 1), practical tasks are often structured and have typically much lower sensitivity (Hahn et al., 2021). In tasks such as sentiment analysis, the label often depends on a sparse subset of input tokens. When practical text datasets such as SST (Socher et al., 2013) are labelled with random noise, then it can be formally shown that their sensitivity would be concentrated around 1/2. As shown in Figure 21, Transformers (and even LSTMs) take much longer to fit such datasets whereas, in the case of the true sentiment labels, these models achieve near-perfect training accuracy in a couple of epochs.
Our results indicate that while Transformers perform poorly on certain regular languages, they generalize more effectively than recurrent models on various sparse Boolean functions. Moreover, we demonstrated that random Transformers as well as those trained with gradient-based algorithms are biased towards functions of low sensitivity. Our results add to the body of evidence that suggests that there is a form of implicit regularization in the procedure used to train neural models which prevents them from overfitting despite their incredible capacity.

Limitations
A general limitation of this line of work is that most of the results are primarily confined to artificial datasets. Although such formal languages provide us with a controlled setting and clarity regarding the precise nature of the problem, the relation to practical tasks remains unclear. Hence, while our results highlight the contrast in the performance between the two types of architectures, its precise implications on real-world tasks remain unclear.
There are two negative results that do not support our hypothesis. (a) All the experiments discussed in the main paper are on strings of fixed lengths. We conducted some experiments on tasks with variable length sequences which in some sense have low sensitivity. The tasks can be seen as a variable length extension of sparse parities and sparse majorities. Unlike the fixed length setting, we found both LSTMs and Transformers perform similarly on those tasks. See Section E.1 in the Appendix for more details. (b) Although we found Transformers to consistently converge to low sensitivity functions in the case of Boolean functions, we did not find similar behaviour on sentiment classification datasets such as SST and IMDB.
A caveat with empirical studies such as this is that the results depend on the hyperparameters and other aspects of the experimental setup. While we have tried to be as thorough as possible with hyperparameter tuning, there is always a chance that the results or behaviour could differ for some hyperparameter.

A Roadmap
The appendix is organized as follows.
• In Section B, we report and discuss additional results on the complexity of random models.
• In Section C, we investigate the sensitivity of models on real data. In particular, we demonstrate that models learn functions of increasing sensitivity on sentiment classification datasets such as SST and IMDB.
• In Section D, we discuss some additional results relating sensitivity and generalization.
• In Section E, we present additional experiments investigating the ability of Transformers and LSTMs to learn Parity and Sparse Parities.
• In Section F, we present some experiments to show that both Transformers and LSTMs can easily fit practical datasets even when they are labelled randomly.
• In Section G, details of implementation and experimental setup are discussed which are relevant for the reproducibility of the results.
• In Section H, we discuss some additional works related to our paper.

B Complexity of Random Models
In this section, we discuss additional results related to the complexity of random Transformers and LSTMs. We present results with additional complexity measures, initialization strategies, and variations across hyperparameters.

B.1 Additional Measures
As discussed in Section 3, sensitivity is related to several other complexity measures. Since it is more tractable to estimate sensitivity as opposed to certain other measures, we primarily focused on estimating and comparing sensitivity in the main paper. We explore three other complexity measures which have been previously explored in the literature to compute the complexity of functions represented by neural models. The measures are defined as follows: 1. SOP (Size of Boolean Expression): This measure computes the size of the smallest Boolean expression in Sum-of-Product form that represents the function. In order to compute this for a neural network over {0, 1} n , we compute the output of the model over all 2 n inputs and then use standard libraries (SymPy (Meurer et al., 2017)) to find the Boolean expression. The size indicates the number of operators and operands in the smallest expression. Since the problem of minimizing Boolean expressions is NP-complete, the runtime grows exponentially, and hence, we can only compute this up to length 10 for several samples of random models. This measure was explored in Valle-Perez et al. (2019).

2.
Entropy : This measure takes the output labels for all 2 n inputs and simply computes the entropy over the labels. This is a weak measure and primarily indicates how imbalanced the label set is. This  of models. We take Transformers and LSTMs with depth ∈ {1, 2, 4, 8} and width (d_model/hidden_size) ∈ {8, 32, 64, 256, 768}. We take an equal number of samples for each hyperparameter. Figure 8 shows the distribution of SOP based on 50k samples for a fixed hyperparameter configuration of Transformer and LSTM. It includes a 1-layer LSTM with width 64 and a 4-layer Transformer with width 64.
As can be seen in Figure 11, there exists significant correlation between sensitivity and other measures. Note that, high sensitivity functions will always have high entropy and high CSR but the converse is not true. Functions with maximum entropy can also have low sensitivity. For instance, the dictator function has maximum entropy (since half the inputs have label 1 and the other half have label 0) while having the minimum sensitivity. Similarly, CSR can be seen as a weaker version of sensitivity.

B.2 Additional Sensitivity Results
The distribution of the sensitivity of Transformers and LSTMs initialized with Xavier uniform distribution are given in Figure 10 respectively. For Gaussian initialization, the weights are sampled with mean 0 and σ = 10. For Xavier uniform initialization all the values in weight matrices are sampled uniformly between −d 1/2 and d 1/2 where d is the number of hidden units. The values in the bias vectors are set to zero and the ones in the input embedding vectors are sampled from N (0, 1).
For LSTMs with uniform sampling, the change in sensitivity across different widths (hidden_size) and lengths is provided in Figure 9. As can be seen, the sensitivity of LSTMs does not significantly reduce across higher lengths and widths, unlike Transformers.

C Sensitivity During Learning Sentiment Classification
In this section, we discuss experiments on measuring the sensitivity of Transformers and LSTMs when trained on the sentiment classification task.

C.1 Experimental Setup
Datasets. We experiment with two sentiment classification datasets: SST (Socher et al., 2013) and IMDB (Maas et al., 2011). For SST, we train on the full train set of size 67349 examples and evaluate both sensitivity and validation accuracy on the validation set of size 872 examples. For IMDB, we preprocess the dataset to only include sentences of length up to 500. This leads to a train set of size 22156. The validation set consists of 8939 examples randomly sampled from the test set. Since the sentences in IMDB dataset are of much longer lengths, in order to save compute, we evaluate sensitivity of models on a dataset of size 894 examples randomly sampled from the test set.
Sensitivity Metrics. Boolean sensitivity as defined in Section 3 cannot be directly applied to sequences of variable length and larger vocabulary. As an alternative, we compute certain proxy metrics which measure how likely it is for the function value to change due to a change in one token of the input sequence. To that end, we design three simple metrics to measure the sensitivity of models trained on sentiment classification: 1. Word Label-Sensitivity: For each word in the sentence (one word at a time), we replace it n times with a word sampled randomly from the vocabulary and measure the average (over n) number of times the predicted label changes. We sum this value for all the words in the sentence and normalize the value by its length.
2. Word Softmax-Sensitivity: For each word in the sentence (one word at a time), we replace it n times with a word sampled randomly from the vocabulary and measure the average (over n) L2-distance between the predicted softmax normalized output vector before and after the replacement. Again, we sum this value for all the words in the sentence and normalize by its length.
3. Embedding Label-Sensitivity: For each word in the sentence (one word at a time), we add Gaussian noise with mean 0 and variance σ 2 to its embedding n different times and measure the average (over n) number of times the predicted label changes. We sum this value for all the words in the sentence and normalize by its length.
For all metrics, the final score is obtained by averaging across all the examples in the dataset. In all our experiments, we set n = 10, and σ 2 = 15. Hyperparameter Details. For both Transformers and LSTMs, we vary the number of layers ∈ {1, 2}, learning rate ∈ {0.0001, 0.0003, 0.0005}, and model width (d_model/hidden_size) {128, 256}. We set the batch size as 128 and the FFN size as twice the width. For LSTMs, we keep the embedding size the same as the hidden size. Both models are trained with Adam (Kingma and Ba, 2014) optimization and using Dropout regularization with probability 0.2.
Results. Figure 12 shows the word softmax-sensitivity for Transformers and LSTMs across different iterations of training for SST and IMDB datasets. The word label-sensitivity and embedding label-sensitivity for SST dataset is provided in Figure 13. We find that across all three measures, both Transformers and LSTMs learn functions of increasing sensitivity where they prioritize learning functions of lower sensitivity first. We found 'word label-sensitivity' and 'word softmax-sensitivity' to correlate well with generalization gap (i.e., the difference between train accuracy and test accuracy). Since the measures are very similar, there   is a strong correlation between the two measures themselves. We did not find any non-trivial correlation between 'embedding label-sensitivity' and generalization gap. Note that unlike random and sparse Boolean functions, on real datasets, we did not find Transformers converging to functions with lower sensitivity.

D.1 Sensitivity as Capacity Measure
We provide a simple illustration to show how maximum sensitivity can be used as a capacity measure to derive generalization bounds. Capacity measures such as the VC Dimension are a classical approach to derive sample complexities and probabilistic upper bounds for the test error of a classification model. Let F k : {0, 1} n → {±1} be a class of functions such that the maximum sensitivity for any function f ∈ F k is upper bounded by k where 0 ≤ k ≤ n. Any function f with a maximum sensitivity k can be uniquely determined by its values on any Hamming ball of radius 2k in {0, 1} n (Gopalan et al., 2016). This can be used to upper bound the size of the function class |F k | ≤ 2 ( n ≤2k ) . Since the VC Dimension (denoted VCD) of a class of functions F is upper bounded by log |F|, we have that, Let f ∈ F k be a target function andf ∈ F k be a hypothesis produced by a learning algorithm. Let L(f , f ) = Proposition D.1. For any δ > 0, with probability at least 1 − δ, the following holds for any function f,f ∈ F k , where c > 0 is some constant. Functions with low maximum sensitivity can be learned with better sample efficiency. Functions with low average sensitivity can also be learned efficiently when the data generating distribution is uniformly distributed over the input (O'Donnell (2021), Sec 3.4).

D.2 Sensitivity and Generalization Gap
The correlation between sensitivity and generalization has previously been studied for networks trained on Boolean functions (Franco, 2006) and image datasets (Novak et al., 2018). We examine the relation between simple variants of sensitivity described in Section C and generalization.
We train various models on SST dataset until convergence and then compare sensitivity with generalization gap. The generalization gap is simply the difference between the train error and test error; higher gap indicates overfitting. We plot the word label-sensitivity and word softmax-sensitivity (defined in Section C) for Transformers, LSTMs, and a pretrained Large Language Model (RoBERTa (Liu et al., 2019)) against the generalization gap (see Figure 14). We observe positive correlation between the measures and generalization gap indicating that when sensitivity is higher, the models are more likely to overfit and achieve poorer generalization performance. Large language models such as RoBERTa have a lower sensitivity while achieving better test accuracies than Transformers and LSTMs trained from scratch.

E Additional Experiments on Parities
Standard Parity. The training curves for LSTMs on standard parity are provided in Figure 15. The models are trained on datasets of size 20k where the input strings are of length 30. Similar to Transformers on Sparse Parities, we observe phase transitions for LSTMs on standard Parity task.
Sparse Parities. The results on sparse parities with length n=40 and k=4 relevant bits for Transformers with absolute positional encodings are provided in Figure 16. We find that Transformers with absolute positional encodings are able to generalize well on Sparse Parities task and exhibit grokking on relatively larger datasets (30k samples) in comparison to models with learnable positional embeddings. For Transformers trained with learnable encoding, we robustly observe grokking on small datasets. Figure 17 depicts the training curves for Transformers trained on datasets of size 5k.
Overfitting. We found that Transformers overfit on training data when the sample size is too low. Apart from that, for datasets of certain sizes, we find that while Transformers with depth up to 6 generalize well, those with much higher depths (> 8) overfit across several runs. Mixed Parity. To explore the difference in bias between Transformers and LSTMs, we conduct a simple experiment described as follows. We create a dataset of size 15k called 'Mixed Parity' where half of the examples are labelled as standard Parity (label is determined by all bits) and the other half is labelled as Sparse Parities with 4 relevant bits. The inputs are of length 30 and the first bit determines whether the input is labelled according to standard Parity function (when the first bit is 1) or as a Sparse Parities function (when the first bit is 0). We train Transformers and LSTMs (with learnable positional encodings) of depth 2 and width 64 across various learning rates ∈ [0.01, 0.00001] on the Mixed Parity dataset. We find that LSTMs obtain 100% training accuracy on the dataset (see Figure 19, right); LSTMs validation accuracy on the Parity task is near 100% whereas it is 50% on the Sparse Parities task. In contrast, the training accuracy of Transformers converges around 75% (see Figure 19, left); their validation accuracy on the Parity task is 50% whereas on Sparse Parities they achieve near 100% validation accuracy.
Convergence time vs Sample Size. For Transformers trained on Sparse Parities, we conduct experiments to compare the number of computational steps required to successfully learn Sparse Parities with the size of the dataset it is trained on. We consider length n=40 and k=4, and create datasets of five different sizes ({5k, 25k, 50k, 100k and 250k}). For each dataset, we train a Transformer of depth 2 and width 128 across 100 different initializations with learning rate ∈ {0.0001, 0.0005} and with batch size 500. We consider each iteration as a computational step. We report the median, minimum and maximum steps for each dataset in Figure 20. We find that neural networks such as Transformers can successfully learn Sparse Parities with relatively small number of computational steps on small-sized training sets. It is perhaps surprising that for Sparse Parities with n=40 and k=4, Transformers can successfully generalize well with less than 20000 computational steps on over 75 out of 100 runs.

E.1 Experiments on Variable Length Inputs
We conducted some additional experiments on tasks with variable length inputs. These tasks are simple extensions of sparse parities and majorities to variable length input and have (in an informal sense) low

sensitivity.
Task. Let VarParity-(n, k) denote the extension of Parity-(n, k) to variable length sequences. A function in VarParity-(n, k) is defined over sequence of {0, 1, 2} where the total number of 0s and 1s are exactly n, along with k relevant indices which determine the label. The input distribution is such that there could be token 2 between any zeros and ones with some probability. The tokens 2 however do not influence the output of the function and are merely constructed to vary the input lengths. The label is determined by removing all the tokens 2 from the input and applying the regular Parity-(n, k) over the remaining string over {0, 1} n .
Results. Contrary to the fixed length setting, we observe that both Transformers and LSTMs perform similarly on these tasks. For both VarParity-(n, k) and VarMaj-(n, k) we experiment with various mean lengths and variances with k = 5. The general behaviour is that both Transformers and LSTMs generalize well when the tasks over short sequences (< 40 for VarParity-(n, k) and < 100 for VarMaj-(n, k)). However as the lengths of the input go beyond that, both architectures do not generalize well.
In comparison to LSTMs, Transformers only performed better when the variance of the lengths of the inputs was very low. An interesting observation about Transformers is that they only seemed to generalize well with positional masking (also referred to as causal masking) along with positional encodings. Their performance was notably worse with only positional encodings (learnable or absolute).
These results do not support the hypothesis posed in Section 1 and we intend to explore this further in the future.  4). The curve depicts the median number (along with min/max) of iterations with a batch size of 500 across 50-100 successful runs for each dataset size. Refer to Section E for details.

F Fitting Randomly Labelled Data
We conduct some experiments to examine the ability of LSTMs and Transformers to fit random noise. The capacity of a class of functions to fit random noise is often theoretically measured as its Rademacher complexity. Given the incredible expressive power of neural networks, measures such as Rademacher complexity lead to vacuous generalization bounds. One assumption was that, despite their capacity, deep neural networks trained with gradient-based methods can only learn a small subset of such functions. The work of Zhang et al. (2021) demonstrated that large feedforward-like networks trained with gradient-based methods are able to fit random noise on image datasets. We conduct similar experiments to evaluate the ability of sequence models to fit noise on text data. We consider the SST dataset (Socher et al., 2013) as used in the GLUE benchmark. The training data contains approximately 65k samples and we label each sample either +1 or −1 randomly (with probability 1/2 each). Figure 21 depicts the training curves for Transformers and LSTMs. We find that both the models are able to conveniently fit the training set near-perfectly. For both the models, the training takes significantly more number of iterations/epochs in comparison to training on the original dataset with true labels which only takes a few epochs.

G Implementation Details
Our implementation of Transformer is based on Rush (2018). For various recurrent models such as RNNs, GRUs, and LSTMs, we use PyTorch's standard implementation (Paszke et al., 2019).
All our experiments were conducted using 16 NVIDIA Tesla V100 GPUs each with 16GB memory. For each dataset, we extensively tune across several hyperparameters and report the results based on the  best-performing models (or average of top 5 runs for best performing hyperparameter). Table 1 lists the hyperparameters used for tuning the models for Boolean function experiments in Section 5. We use a grid search procedure to tune the hyperparameters. For all our results, we used Adam Optimizer and tuned the learning rates. We also tried SGD with weight decay but could not get either Transformers or LSTMs to perform well on parities, sparse parities, or random k-sparse functions.