Pretraining Without Attention

Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms of interactions and syntactic representations. All models from this work are available at https://github.com/jxiw/BiGS.


Introduction
Transformers are the de facto model architecture for NLP pretraining (Vaswani et al., 2017).Since BERT (Devlin et al., 2018), they have proven central to NLP tasks with their ability to learn effectively on large unlabeled datasets.Specifically, the use of attention as a central routing component seems to be critical to empirical success on downstream tasks.Other architectures have been proposed but require attention layers for high-accuracy (Tay et al., 2020b;Lee-Thorp et al., 2021).
Is the centrality of attention in pretraining due to inductive bias or computational convenience?This question is complicated by the properties of common sequence routing layers: recurrent neural network (RNN) models do not scale as well as attention, whereas convolutional neural networks (CNNs) can not easily model long-distance dependencies.
State-space models (SSMs) for deep learning provide a promising alternative.Recent works show that SSMs are a competitive architecture for long-range sequence modeling (Gu et al., 2021).SSMs achieve strong results on speech generation (Goel et al., 2022) and on the Long Range Arena benchmark (Tay et al., 2020a) outperform standard and long-range transformer architectures (Gu et al., 2021;Gupta, 2022;Gu et al., 2022;Smith et al., 2022).In addition to improving accuracy, SSM-based routing does not have quadratic complexity as the length of the sequence grows.Concretely, the model provides a way to achieve RNN-like long-range dependencies with CNN-like training speed.This work proposes an architecture for applying SSMs using a Bidirectional Gated SSM (BiGS) model for BERT-style pretraining.BiGS uses SSMrouting at its core as a replacement for attention.However, this change alone significantly degrades the representational capacity of the model.To target this issue, we develop a multiplicative gating architecture (Dauphin et al., 2017;Hua et al., 2022;Mehta et al., 2022).In combination, this leads to a simpler routing approach that remains surprisingly effective at modeling necessary interactions.
Experiments compare SSMs to standard NLP pretraining.While we find that SSMs by themselves underperform on NLP pretraining tasks, BiGS is able to match the performance of a BERT model when trained on the same data in a controlled setting.By additionally pretraining on longerlength instances, the model is able to grow without approximation to extend to input sequences of length 4,096.Analysis shows the importance of multiplicative gating in fixing specific issues of variable-length textual input.All models from this work will be available open-source (Apache 2.0 license) upon release.

Related Work
Prior to BERT, promising pretraining approaches for learning contextual representations were learned using RNN-based models (McCann et al., 2017;Peters et al., 2018).While important precursors, their accuracy did not scale with data or compute as well as Transformers.This gap remains even when back-porting best-practices from Transformer pretraining (Peters et al., 2019).Recently Tay et al. (2021) explored pretraining with several convolutional (CNN) variants.Results show that CNN without attention does not perform well, although they note benefits in routing speed.Lee-Thorp et al. (2021) propose FNet which replaces the attention layer with a Fourier transform.Without attention, this achieves 92-97% results on GLUE (Wang et al., 2018).Other works have used CNN-based models with multiplicative gating for NLP tasks such as machine translation (Dauphin et al., 2017).We believe BiGS is the first model to achieve BERT-level transfer learning on the GLUE benchmark without attention.
Researchers have begun to use state-space models for NLP tasks, and have primarily focused on auto-regressive language modeling.In S4 (Gu et al., 2021) and its variants (Gupta, 2022;Gu et al., 2022), researchers experimented with language modeling, achieving promising results, though slightly worse than transformers.Gated State Space adapts a SSM plus gating approach to language modeling (Mehta et al., 2022).Concurrent to this work, Dao et al. (2022b) propose H3 which closes the gap in auto-regressive language modeling, and with two attention layers outperforms transformers on OpenWebText.Finally, a related method, MEGA (Ma et al., 2022) combines exponential moving average routing with a simple attention unit to outperform transformer baselines.Our approach instead focuses on bidirectional masked language modeling and questions of downstream generalization.

State Space Models
A state space model (SSM) is a general-purpose tool for describing the relationship between a continuous-time scalar input u(t) to scalar output y(t) by the following differential equations: x ′ (t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t).
Figure 1: A SSM learns a one-dimensional kernel K, which is convolved with the input sequence u to produce output y.Unlike attention, routing is static and does not depend on the input.In BiGS, we use only two kernels per layer (forward and backward).Figure 3 shows all the kernels used in the fully trained model.
Where x(t) ∈ R N is a continuous-time state vector, x ′ (t) is its derivative, and the equation is pa- When applied to a discrete-time scalar input sequence u 1 , . . .u L , the SSM equations and parameters can be discretized, leading to the following recursion, Where A, B, C, D are functions of the original parameters and a discretization rate.
This equation can be computed like an RNN where x k ∈ R N is a hidden state at time k.Unlike an RNN though, the linearity of the recursion allows y 1 . . .y L to be computed directly using a convolution with precomputed kernel The process is illustrated in Figure 1.In a practical sense, after training, this kernel K fully characterizes the SSM, i.e. the model is a 1D convolution with a very long kernel.

Learning SSMs
Gu et al. (2020, 2021) demonstrate an effective approach for using SSMs in neural networks.The core insight is to propose an initialization of the transition matrix A, known as HiPPO,  (Mehta et al., 2022;Hua et al., 2022).Different from (Mehta et al., 2022), we avoid reducing model dimension of the SSM.For the Routing component (dashed lines), we consider both a bidirectional SSM (shown) and standard self-attention.The gate (⊗) represents element-wise multiplication.The BiGS model uses GATED with SSM.
This matrix yields a stable training regime that can also be efficiently trained.The full model, S4, retains the SSM ability to model long-term sequences while being more efficient than RNNs to train.Recently, researchers (Gu et al., 2022;Gupta, 2022) have proposed simplified diagonalized versions of S4, which achieve comparable results with a simpler approximation of the original parameterization.In preliminary experiments, we used several different S4 parameterizations but did not find a significant difference in accuracy.Throughout the work, we use S4D as the parameterization.
While the specifics of SSM discretization, parameterizations, and training are beyond the scope of this work, at a high-level, we note that each variant of SSMs leads to a similar convolution form.The model can therefore be trained by backpropagation through the convolution without the serial bottleneck of RNNs, and applied without the quadratic cost of attention.

Multiplicative Gating
Gating units have been widely used to improve the performance of various architectures such as MLP, CNN, and Transformers (Dauphin et al., 2017;Shazeer, 2020;Narang et al., 2021).One exam-ple of such a gating unit is the Gated Linear Unit (GLU) which has been used effectively for CNNbased NLP systems (Dauphin et al., 2017).Let u represent an input activation.GLU first computes both a gating vector and a linear transform, σ(Wu) and Vu respectively.The output of the layer is then the element-wise product σ(Wu) ⊗ (Vu).
Recent work has shown that gating can increase the performance of models using simplified routing.Hua et al. (2022) show that linear time attention models can benefit from improved gating.Mehta et al. (2022) propose a Gated State Space architecture using gating for unidirectional SSM models.Multiplicative gating may restore some of the interaction capacity from full attention-based interactions.

BiGS Model
We consider two different architectures for SSM pretraining: a stacked architecture (STACK) and a multiplicative gated architecture (GATED) shown in Figure 2.
Transformer Architecture The STACK architecture with self-attention is equivalent to the BERT / transformer model.We replace the attention block with two sequential SSM blocks to mimic the na-ture of bi-directional self-attention.

Gated Architecture
The GATED architecture is a bidirectional adaptation of the gated unit of Hua et al. (2022).Specifically, let X i ∈ R L×d be activations at the i-th layer where the length is L, and the model size is d.We use the activation GELU (Hendrycks and Gimpel, 2016) for σ.The first stage computes, The second stage uses 2 sequential blocks (i.e., a forward and backward SSM layer) with a multiplicative gate.
The third stage uses a feed-forward layer again with gating, to replace the two dense blocks in the traditional transformer architecture.We sum this output O with the original input X i finally as the input X i+1 of the next layer i + 1.
The number of parameters per layer in gated SSM is roughly 13d 2 while the number of parameters per layer in the stack is 12d 2 .We compensate for this difference by using fewer gated layers.
Different from (Mehta et al., 2022), we find that the hidden dimension size of SSM layers is critical.Reducing that hidden dimension results in a notable decrease of the perplexity(−0.67) in the MLM in our 11B (short) training setting.

SSM Layer
The SSM layer under both architectures is a map over vector sequences, SSM(X) : R L×d → R L×d .However, we defined SSM over scalar sequences.Past work, creates d differently parameterized SSMs for each dimension (Gu et al., 2021).Experimentally though, we found it just as practical to use the same parameterization (and therefore kernel K) for each hidden dimension.This simplifies model analysis and makes the total number of SSM parameters negligible.(Izsak et al., 2021).See Figure 2 for details.We fine-tune RTE, MRPC, and STS-B from a MNLI checkpoint following the convention by (Izsak et al., 2021).We average results of six runs and report accuracy for MNLI, QNLI, RTE, SST-2 and F1 score for QQP, MRPC and Matthew's correlation for CoLA and Spearman's correlation for STS-B.All models are comparable to BERT-Large in size.(Bottom) Reported comparable results for other non-attention-based pretraining models based on CNNs, LSTMs and FNet (Peters et al., 2018;Tay et al., 2021;Lee-Thorp et al., 2021;Wang et al., 2018).BERT 1 represents the official BERT result (Devlin et al., 2018), and BERT 2 represents the result using an MNLI checkpoint for other NLI tasks (Izsak et al., 2021).We use − to denote those results were not reported by previous research.
compared to using the [CLS] token as the sentence representation.
Our SSM implementation is based on the Annotated S41 (Rush, 2022), and our pretraining uses the template from Hugging Face Transformers2 (Wolf et al., 2020).We experimented with variants of SSMs and found they performed similarly; experiments use S4D (Gu et al., 2022) for simplicity.Note that for a fair comparison, we keep the size of the gated architecture comparable to a stacked architecture and our BERT implementation.

GLUE
Table 1 (Top) shows the main results for different pretrained models on the GLUE benchmark.In short and medium training, we note that the STACK architecture is significantly better with attention than with SSM-routing.However, with the GATED architecture, the SSM achieves competitive results.To confirm this is not simply from a better architecture, we try gating with attention but find it does not improve.On full training, BiGS continues to improve in accuracy.compare to other non-attention based pretrained models based on RNNs and CNNs (Peters et al., 2019;Tay et al., 2021;Lee-Thorp et al., 2021).Results from these works all show significant degradation in transfer learning with GLUE scores far below BERT.Next, we compare BiGS to the full BERT results as reported in past work, both from the original paper (Devlin et al., 2018) and from follow-up works with an improved fine-tuning convention (Izsak et al., 2021).We see that the BiGS model achieves comparable test scores.While the final GLUE score is nearly identical we do see that the models perform differently on the underlying tasks, which we explore more below.
We also apply BiGS to SQuAD (Rajpurkar et al., 2016).SQuAD requires extending the length of the model from 128 to 512 tokens through additional training.We report the F1 score in Table 2.We see that BiGS outperforms BERT when adapted with this procedure (Wettig et al., 2022).We note that both of these results underperform original BERT SQuAD results.

Long-Form Classification
An advantage of SSM-based routing is that models can extend to longer-ranges without requiring approximation.To adapt to longer range classification, we continue pretraining on longer data (4,096).Table 3 shows results on encoder-only experiments in SCROLLS (Shaham et al., 2022), a recent long-range language modeling benchmark.We can compare the model to Longformer Encoder-Decoder (LED) and BART.On these long-range tasks, it performs as well or better, taking advantage of the long-range context.

Role of SSM
Compared to multi-head attention where routing is determined by L 2 attention coefficients per head per layer, the BiGS SSM routing is relatively compact.Each layer has only 2L static values in K. Figure 3 shows these values in the form of the forward and backward kernels.These kernels correspond partially to local aggregations such as the previous word (layer 1) or a subsequent trigram (layer 6), and partially to long-term future or past information (layer 14, layer 17).
Figure 4 shows how these kernels change during finetuning.In particular, during MNLI finetuning, the model needs to look at more long-distance information to match between sentences.This results in most local kernels remaining the same, but long distance kernels adjusting.The figure shows three kernels expanding their scope outward.

Role of Gating
GLUE results show a significant improvement in downstream accuracy with the GATED model; however, we actually find that the worse STACK SSM model has a similar pretraining MLM loss.Figure 5 illustrates the difference of MLM loss and MNLI accuracy for both GATED and STACK SSM, compared to the MLM loss and expected MNLI values presented in BERT (Devlin et al., 2018).The figure shows that for the GATED model downstream accuracy tracks MLM loss, while for STACK it does not.We speculate that multiplicative gating helps the SSM model recover some of the generalization ability of attention, particularly for handling long sequences.For example, table 6 compares accuracy of examples binned by length on the QNLI task.We see that the GATED SSM maintains accuracy as examples get longer and required dependencies move further apart.Table 4: FLOP comparison between BiGS and BERT with respect to input token length.We calculated FLOP with a batch size of 1 and considered both the forward and backward passes.

Efficiency Analysis
A benefit of BiGS is the ability to scale to much longer sequences without a quadratic increase in Floating Point Operations (FLOPs).In Table 4, we compare theoretical FLOPs of BiGS and BERT for different input token lengths to better understand their relative scalability.At lengths up to 512, the cost of both models is dominated by the feedforward networks, but when growing beyond 1024, the BiGS approach has a significant FLOP advantage over attention.This increasing efficiency gap trend continues nonlinearly with token lengths of 1024 and 4096 respectively, implying that BiGS is better equipped to handle applications with longer input sequences.
In practice, efficiency is dependent on hardware and implementation.Figure 7 shows an empirical comparison between two versions of BERT -HuggingFace BERT (Wolf et al., 2020) and BERT with FlashAttention (Dao et al., 2022a) -to BiGS equipped with FlashConv (Dao et al., 2022c).FlashAttention is highly optimized FP16 implementation of attention while FlashConv is implemented using FP32 internally for long-range convolution.These models were tested under iden- tical conditions on a single NVIDIA RTX A6000 GPU for one forward pass of the large model.The results show that BiGS outperforms basic attention, and outperforms highly-optimized FlashAttention when sequence length passes 3k.When comparing to a model without any routing, we can see that the efficiency bottleneck of BiGS lies in the dense layers, while the SSM adds relatively little overhead, even past 8k tokens.

Task Analysis: Syntactic Properties
While the average GLUE results are similar, BiGS underperforms on some tasks, and overperforms on syntactic tasks such as CoLA (Warstadt et al., 2019) (Appendix Figure 9 and 10).We speculate that these results indicate that SSM-routing may have different inductive biases than attention.We follow Goldberg (2019) in adapting two preliminary experiments with of syntactic tests for masked language modeling: Linzen et al. ( 2016) test a model's ability to distinguish agreement in the presence of spurious intervening "agreement attractors".For example, the sentence "Yet the ratio of men who survive to the women and children who survive [is] not clear in this story" has three attractors for the masked work [is].Figure 8 shows that BiGS consistently outperforms BERT as number of attractors grows.Marvin and Linzen (2018) develop pairs of manually constructed examples targeting various syntax phenomena and difficulties.Given a pair of examples from this stimuli: "No students have ever lived here" and "Most students have ever lived here", we feed an adapted version "[MASK] students have  Linzen (2018).Numbers of LSTM models are taken from (Goldberg, 2019).(2016).Tests ability of models to match word agreement in the presence of intervening attractors.
ever lived here" into a model and compare the predicted scores for the masked position "No" and "Most" from it.Results are reported in Table 5 and again show that SSM outperforms BERT on several agreement phenomena.While more experiments are needed, it is possible that BiGS leads to an inductive bias to a more stack-like representation, since it cannot rely only on dynamic matching.

Annotated CoLA
The CoLA corpus collection, as described in (Warstadt et al., 2019), is a vital task within the GLUE benchmark (Wang et al., 2018) for evaluating the acceptability of language models.This corpus has been specifically annotated with 13 different syntactic phenomena in order to more accurately quantify the linguistic knowledge of pretrained language models (LLMs) (Warstadt and Bowman, 2019).We utilized the annotated instances from this corpus to conduct a detailed analysis of the mistakes made by BiGS and BERT models.Specifically, we used the annotated instances to break down the errors made by these models and understand where they struggle with linguistic knowledge.Results are shown in Figure 9.We discovered that in 9 out of the 13 categories of syntactic phenomena, the BiGS model performed better than the BERT model, and significantly so in two domains.We hypothesize that the inductive bias that BiGS learned during training may have contributed to its superior performance in understanding these syntactic phenomena.It is likely that the specific inductive biases encoded in the BiGS model enabled it to better comprehend the nuances of these syntactic phenomena, leading to its improved performance.
We break down the matthews correlation coefficient (MCC) of the BiGS and BERT model w.r.t sentence length in Figure 10.BiGS outperforms BERT on both short and long text.

Conclusion
We propose BiGS as a model for pretraining without attention.BiGS makes use of SSM-based routing and multiplicative gating.Results show that SSMs alone perform poorly in a stacked architecture, but gating helps them to generalize.As far as we are aware, this architecture is the first to replicate BERT results without attention.
This work opens up many interesting questions.We experimented with adapting to longer text, but SSM-based models could be pretrained fully on much longer sequences.Combining SSMs with reductions in feed-forward costs could give further optimizations.Finally, we took the steps in exploring the syntactic properties of SSMs, but need further probing of how their internal representations lead to these properties.

Limitations
While SSMs are a promising technology for pretraining, they are not yet a full replacement for attention.One limitation is that this work only considers an encoder model and not an encoderdecoder setup.This makes it challenging to compare to BART and LED in some longer-range evaluations.For example, in our preliminary studies in applying BiGS to long-range question answering WikiQA (Yang et al., 2015), TriviaQA (Joshi et al., 2017), we did not see direct benefits of SSM in an encoder setting.Others have experimented with decoder SSM models, but it is not clear how crossattention should work with these models.This work also considers SSMs for bidirectional pretraining, and not autoregressive modeling.Therefore, some benefits of SSMs are less apparent, such as the utilization of RNN generation.

Ethical Considerations
Our models are trained using a corpus consisting of existing collections of text from Wikipedia and books.Recent research has uncovered potential societal biases that are embedded within many established corpora.While it is beyond the scope of this paper to delve into these biases in depth, we acknowledge the potential risk that our pre-trained models may inherit these biases.In light of this, we are interested in exploring whether previous research on language bias detection can be applied to BiGS, as part of future work.Additionally, in this paper, we have focused solely on the English corpus, and it would be interesting to investigate how BiGS can contribute to multi-lingual language modeling in the future.

Figure 3 :
Figure 3: Complete SSM routing learned in BiGS.Shows forward and backward kernels K at each layer (0-22).Values indicate the absolute value of the contribution of each relative position (-10, . .., 10) cropped from the full 2 × 128.Min-max scaling of absolute values is used for visual normalization.

Figure 5 :
Figure 5: Role of gating in downstream accuracy.Compares MNLI accuracy with respect to MLM loss.BERT values from Devlin et al. (2018).Gated SSM shows similar pretraining transfer as BERT, whereas Stack SSM does not.

Figure 6 :
Figure 6: Role of gating in generalization.Compares accuracy on QNLI by binned length.Gated models generalize to similar length sequences as BERT (stack / att).

Figure 7 :
Figure 7: Efficiency analysis.Compares several optimized implementations: BiGS with FlashConv, BERT, BERT with FlashAttention in PyTorch 2.0, and a gated architecture with no routing.

Figure 8 :
Figure 8: Syntactic Attractors task from Linzen et al. (2016).Tests ability of models to match word agreement in the presence of intervening attractors.

Figure 9 :Figure 10 :
Figure 9: CoLA Results in Different Categories as annotated byWarstadt and Bowman (2019).MCC was used to measure the performance.

Table 1 :
GLUE Results.(Top) Comparison of different architectures and routing in a controlled setting

Table 1 (
Bottom) compares the BiGS architecture to other reported results on GLUE.First, we
(Lewis et al., 2019)0)er Test set results.Baseline models are both encoder-decoder models, one based on Longformer (LED)(Beltagy et al., 2020)and the other on BART(Lewis et al., 2019).Inputs are truncated at length.

Table 5 :
Targeted Syntactic Evaluation from Marvin and