HyperMixer: An MLP-based Low Cost Alternative to Transformers

Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.


Introduction
Attention-based architectures, such as the Transformer (Vaswani et al., 2017), have accelerated the progress in many natural language understanding tasks. Part of their success is a result of a parallelizable training scheme over the input length. This improves training times and allows for larger volumes of data which makes these models amenable to pretraining (Radford et al., 2018;Devlin et al., 2019). Therefore, many current state-of-the-art models are fine-tuned extensions of large pretrained Transformers (Bommasani et al., 2021).
However, these models come at a significant computational cost. They require considerable resources for pretraining and fine-tuning, which induces high energy consumption (Strubell et al., 2019) and limits access to research (Bommasani et al., 2021). Subsequently, Schwartz et al. (2020) argue the need for "Green AI". They propose a cost evaluation of a result R as following: where E is the computational cost measured in floating point operations (FPO) of a single example, D is the dataset size, and H is the number of hyperparameter configurations required during tuning.
To achieve a cost reduction, this paper proposes a simpler alternative to Transformers. We take inspiration from the computer vision community, which has recently seen a surge of research on Multi-Layer Perceptrons (MLPs). Most prominently, MLPMixer (Tolstikhin et al., 2021), which is a simple architecture based on two MLPs: one for token mixing and one for feature mixing. However, the token mixing MLP learns a fixed-size set of position-specific mappings, arguably making MLPMixer's architecture too detached from the inductive biases needed for natural language understanding, in contrast to Transformers (Henderson, 2020).
In this paper, we propose a simple variant, Hy-perMixer (Figure 1), which creates a token mixing MLP dynamically using hypernetworks (Ha et al., 2016). This variant is more appropriate, as it learns to generate a variable-size set of mappings in a position-invariant way, similar to the attention mechanism in Transformers (Vaswani et al., 2017). In contrast to Transformer's quadratic complexity, HyperMixer's complexity is linear in the input length. This makes it a competitive alternative for training on longer inputs.
Empirically, we demonstrate that HyperMixer works substantially better on natural language understanding tasks than the original MLPMixer and related alternatives. In comparison to Transformers, HyperMixer achieves competitive or improved results at a substantially lower cost Cost(R) ∝ E · D · H: improved inference speeds (E), especially for long inputs; favorable performance in the  (MLP). For token mixing, MLPMixer uses an MLP with a fixed size, maximum input length N and position-specific weights. In contrast, HyperMixer generates an appropriately sized MLP based on the variable size of the input in a position-invariant way, similar to the attention mechanism. When using attention as token mixing the whole layer is equivalent to a Transformer encoder layer. low-resource regime (D); and efficient tuning for hyperparameters (H). We attribute HyperMixer's success to its ability to approximate an attentionlike function. Further experiments on a synthetic task demonstrate that HyperMixer indeed learns to attend to tokens in similar pattern to the attention mechanism.
In summary, our contributions can be enumerated as follows: 2 Method

Inductive Biases in NLP Models
In machine learning, the inductive biases of a model reflect implicit modeling assumptions which are key to facilitate learning and improve generalization on specific tasks. In NLP, well-known models with strong inductive biases include: recurrent neural networks (Elman, 1990), which assume the input to be a sequence; and recursive neural networks (Socher et al., 2013), which assume a treestructure. While both these inductive biases are reasonable, empirically, Transformers have been more successful in recent years. Furthermore, we reiterate the arguments of Henderson (2020) for inductive biases in language and apply them to our model design. Henderson (2020) attributes the Transformer's success to two concepts: variable binding and systematicity. Variable binding refers to the model's ability to represent multiple entities at once. This is arguably challenging in single-vector representations such as recurrent neural networks. However, Transformers represent each token with its own vector which accounts for variable binding as each token can be interpreted as an entity. Systematicity refers to the models ability to learn generalizable rules that reflect the structural relationship between entities (Fodor and Pylyshyn, 1988). Transformers achieve systematicity through the attention mechanism which is a learnable set of functions that determines the interaction between entities by matching query representations to key representations (as shown in Figure 1). The mechanism modulates, for every position in the sequence, how to functionally process any other position. Moreover, these function parameters are learnable and shared across all entities.

MLPMixer
A general layer of MLPMixer is shown in Figure 1. Similarly to Transformers, each token is represented as a vector of features, which undergo (nonlinear) transformations in multiple layers. MLP-Mixer employs two MLPs at each layer, one for feature mixing and one for token mixing. The feature mixing component is applied to each token vector independently, which models the interactions between features. The Token Mixing MLP (TM-MLP) is applied to each feature independently (i.e. its vector of values across tokens), which models the interactions between spatial locations or positions. This could be interpreted as a global attention mechanism which is static and position-modulated. Practically, this is achieved by transposing the dimension representing the features and the dimension representing the positions.
Each vector x T i ∈ R N , representing feature i ≤ d, of some input of fixed length N , is input into TM-MLP, which has the following form: where W 1 , W 2 ∈ R N ×d ′ , and σ represents the GELU non-linearity (Hendrycks and Gimpel, 2016 HyperMixer includes systematicity into the MLPMixer architecture by introducing a novel token mixing mechanism, HyperMixing 1 , which can be regarded as a drop-in replacement for attention. For ease of understanding, we provide pseudo-code in Algorithm 1. While the queries, keys, and values in HyperMixing need not be the same, we will assume they are identical in the following formulation. HyperMixing relies on the use of hypernetworks, which are used to generate the weights W 1 , W 2 of TM-MLP (Equation 1) dynamically as a function of the input. Let x j ∈ R d , j ≤ N , where N is the (variable) dimension of the input, represent token j (i.e., query, key, and value). W 1 and W 2 are generated by parameterized functions h 1 , h 2 : R N ×d → R N ×d ′ . Theoretically, h 1 and h 2 could be any function, including sophisticated networks that consider non-linear interactions between tokens, such as the attention mechanism. However, this would defeat the purpose of our model, which is simplicity. Therefore, we choose to generate the rows of the weight matrices from each token independently via another MLP. Concretely, a hypernetwork function can be defined as selves multi-layer perceptrons with GELU nonlinearity. p j ∈ R d is a vector that can encode additional information such as the position via absolute position embeddings (Vaswani et al., 2017). Intuitively, for each token x j , h 1 decides which information to send to the hidden layer of TM-MLP, where the information from all tokens are mixed, and h 2 decides for each token how to extract information from the hidden layer. Note that, even though h 1 and h 2 only consider one token at once, non-linear interactions between tokens are still modeled through the hidden layer of TM-MLP.
Finally, layer normalization (Ba et al., 2016) can be applied to the output of TM-MLP. We found this helpful to facilitate training with a wide variety of Transformer layouts (Appendix F).
Tying h 1 and h 2 In order to reduce the number of parameters and operations in the model, and thereby the complexity, we found it useful to tie h 1 and h 2 by setting W 2 = W 1 .
Considerations for NLP In comparison to the MLPMixer defined in Section 2.2, the use of hypernetworks overcomes two challenges. Firstly, the input no longer has to be of fixed dimensionality. The hypernetwork generates a token mixing MLP of appropriate dimension as a function of the input. Secondly, the hypernetwork models the interaction between tokens with shared weights across all positions in the input. Hence, systematicity is ensured.

Related Work
Research on all-MLP models like MLPMixer (Tolstikhin et al., 2021) is widespread in the computer vision community (Tu et al., 2022;Yu et al., 2022;Wang et al., 2022, among many others). However, they lack some desirable inductive biases for NLP, which we discuss in length in Appendix A.2. Specifically, in contrast to HyperMixer, none of the previously proposed methods simultaneously provide i) position invariance, which is important for generalization, ii) adaptive size for variable-length inputs, iii) a global receptive field, which allows interactions to not be limited to small token neighborhoods, iv) learnabilty allowing for universal applicablility to various tasks, and v) dynamicity, which means that token mixing is a function of the input. Consequently, only a few works have used MLP-based models as their backbone in NLP tasks. gMLP (Liu et al., 2021) serves as one of our baselines and pnlp-mixer (Fusco et al., 2022) employs standard MLPMixer on top of a novel token embedding method.
Apart from all-MLP models, there is an abundance of research on efficient alternatives to standard attention layers (Katharopoulos et al., 2020;Bello, 2021, et cetera). While they don't qualify as all-MLP models, they have close connections to our work (see Appendix E) and aim at lowering the cost of AI, albeit it on fewer dimensions than our work (Appendix A.1). We employ FNet (Lee- Thorp et al., 2021) and Linear Transformers (Katharopoulos et al., 2020) as representatives of these as a baseline.

Experiments
Our experiments are designed to test the following three hypotheses. H1 (Section 4.3): Since Hy-perMixer reflects more inductive biases that are adequate for NLP, our hypothesis is that Hyper-Mixer performs better at NLP tasks than MLP-Mixer and similar MLP-based alternatives, specifically at those tasks that require to model the interactions between tokens. H2: Since HyperMixer has similar inductive biases as transformers but is considerably simpler conceptually and in terms of computational complexity, it can be seen as a low cost alternative to Transformers, reducing the cost in terms of single example processing time (Section 4.4), required dataset size (Section 4.5), and hyperparameter tuning (Section 4.6). H3 (Section 4.7): Due to its inductive biases mirroring those of Transformers, HyperMixer also learns similar patterns as the attention mechanism.

Datasets
We evaluate on four sentence-pair classification tasks and one single-sentence classification task. The sentence-pair tasks are QQP (Iyer et al., 2017), QNLI (Rajpurkar et al., 2016), MNLI (Williams et al., 2018) and SNLI (Bowman et al., 2015). For uniformity, datasets are formatted as in the GLUE benchmark (Wang et al., 2018). We choose these tasks for two properties: firstly, they have large training datasets (Table 2, appendix) enabling reasonable performances without pretraining; secondly, solving these tasks requires good modeling of the interactions between tokens from different sentences, which is the main focus of this paper. As a control, we experiment on the single-input dataset SST2 (Socher et al., 2013), which is a sentiment classification task. Many examples in this dataset can be solved by identifying key sentiment words, rather than modeling the token interaction.

Baselines
The following baselines can be categorized into MLP-based (to support H1) and not MLP-based (e.g., Transformers, to support H2). Note that our study is about the design of the token mixing module. Therefore, we only compare to models that fit into the general framework displayed in Figure 1, where there is a feature mixing module and a token mixing module for textual inputs. As a result, models such as RNNs are excluded. To enable a controlled experiment, we use the same feature mixing module in all models; the models only differ in their token mixing module.

MLP-based The conceptually closest baseline is
MLPMixer (Tolstikhin et al., 2021), which combines both token and feature mixing using fixed dimensional MLPs, as described in Section 2.2. Concurrently, (Liu et al., 2021) proposed gMLP, in which token mixing is achieved through weighted summation of all other inputs, similar to the attention mechanism. However, rather than computing weights as function of the inputs like in attention, in gMLP the weights are fixed learnable parameters. Additionally, linear gating initialized close to one is introduced to facilitate training. The original gMLP method does not employ feature mixing modules, as their token mixing module is capable of modeling feature interactions as well in a single gMLP block. However, for comparability we inject gMLP blocks as token mixing modules in our general architecture and keep feature mixing modules as well.
Non MLP-based Transformers (Vaswani et al., 2017) are used in the current state of the art in virtually all NLP tasks. Their key component is the softmax-based self-attention module, which we use for token mixing. Linear Transformer (Katharopoulos et al., 2020) replaces softmax attention with a featuremap based dot-product attention.
Finally, FNet (Yu et al., 2021) replaces the self-attention part of Transformers with a fixed, non-learnable set of Fourier transforms for token mixing.

Performance
Initially we compare the performance of Hyper-Mixer in comparison to our baselines. Thereafter, we further explore the model's benefits with respects to its cost.
For comparability, we adjust the size of the token mixing components such that all models have the same number of parameters (11M). FNet is an exception since it has no learnable parameters in its token mixing component. We tune the learning rate of each model via grid-search, and report the performance of the best configuration. Further experimental details on all experiments can be found in Appendix B.
Results Validation and test set results are shown in Table 1. On the test and the validation set, Hyper-Mixer performs the best among MLP-based models on all datasets, although for SST the difference on the validation set is smaller than one standard deviation. MLPMixer generally achieves good performances, outperforming Transformers on two datasets.
Comparing to non-MLP-based methods, Hyper-Mixer also outperforms vanilla Transformers on all datasets. The differences are generally small (≤ 2 points), except on QNLI, where the difference is 3.9 points. We suspect that this discrepancy is due to the relatively small training set of QNLI. We investigate low-resource behavior of Transformers in comparison to HyperMixer in Section 4.5. FNet performs substantially worse than the other methods, particularly on SNLI and QQP. Linear Transformers achieve excellent performance on MNLI and SNLI, but perform poorly on QNLI and QQP. In Appendix C.2, we discuss ablations such as untied HyperMixer.

Time per Example
In order to assess the efficiency of our model, we measure the wallclock-time of processing a single input (repeated 1,000 times) through the token mixing stages of HyperMixer and Transformer, respectively. As Schwartz et al. (2020) point out, wallclock time has the downside of being dependent on the specific implementation, and they therefore recommend reporting the number of floating point operations (FOPs) required by one forward pass. In Figure 2, we show wallclock time and theoretical FOPs as a function of the input length N . For short input sequences, the number of FOPs is dominated by the size of the hidden layer and hence slightly lower for Transformers than for HyperMixer. However, in practical terms we observe that HyperMixer is still faster than Transformers. At longer input sequences, the size of N starts to dominate the total complexity of Transformers, so that it becomes exceedingly slower than HyperMixer.

Low Resource Performance
Like MLPMixer, HyperMixer is a conceptually simple architecture, as it only applies multi-layer perceptrons at its core. Simpler architectures often make for better performance on smaller scale datasets. We investigate this by varying the number of examples used for training on the three large datasets MNLI, SNLI, and QQP. For these experiments, we use the best performing learning rate found in the grid search from Section 4.3. In Fig ure 3, we plot the relative performance change of HyperMixer compared to Transformers as a function of subsample size. On all datasets, the relative improvement of HyperMixer over Transformers is larger when training with 10% of the dataset than with the full dataset. While the effect is small on QQP, it is particularly large on SNLI and MNLI, where HyperMixer performs almost 12-14% better with 10% of the data, while the relative improvement with the full dataset is less than 2%.

Ease of Hyperparameter Tuning
MLP-based token mixing has the advantage that it is conceptually simpler than self-attention, and that it is well-known how to facilitate training via mechanisms such as skip-connections and layer normalization. Both these aspects suggest that it might be easier to find hyperparameter configurations that yield good performances. In these experiments, we compare HyperMixer (with tied hypernetworks) to Transformers in this regard. As recommended in Schwartz et al. (2020), we perform a random search to tune hyperparameters and compute the expected validation performance (Dodge et al., 2019(Dodge et al., , 2021. Specifically, we tune the learning rate, whose logarithm is drawn from U(−8, −1), and the dropout probability drawn from U(0, 0.5) for 20 trials.

Results
In Figure 4, we show the relative expected validation performance, i.e., the relative performance change of HyperMixer compared to Transformer, for all five datasets. With the notable exception of QNLI, the relative improvement of HyperMixer is higher at smaller budgets than at larger budgets on all datasets. The effect is par- ticularly strong on SNLI, where HyperMixer is 6.5% better at small tuning budgets, but less than 2% better at high budgets. These results indicate that HyperMixer is substantially easier to tune than Transformers.

HyperMixer Learns Attention Patterns
We hypothesized that the token mixing layer of HyperMixer offers a mechanism similar to attention. To show this, we consider a toy problem with 1d sequences composed of shape pairs of different heights as described in Fleuret (2019). The target value is the average height in each pair of shapes. An example input is shown in Figure 5a. To solve the task well, for each position, the model must attend to other positions with the same shape.
Models We compare the token mixing layer of HyperMixer to three other models: i) None does not model token interactions. All predictions are thus only made based on local information. This model should thus fail. ii) MLPMixer does model token interactions. Still, since its token mixing weights are position-specific, each position has to learn to recognize each shape, which we expect to be difficult, especially with little data. iii) Selfattention can be considered the upper bound, as it models the interaction between every two positions explicitly.
Results Figure 5b shows the mean squared error on the test examples depending on the number of training examples. As expected, None fails on this task. While all other models are able to solve the task with enough training data, MLP-Mixer is considerably less data-efficient than the other two models, requiring 5-10 times more data to reach the same performance. This is expected, since in contrast to HyperMixer and self-attention, MLPMixer's token mixing module is not positioninvariant. HyperMixer and self-attention reach approximately the same performance when training on 100k examples. However, HyperMixer is more data-efficient than self-attention, which we attribute to the simpler model architecture. We can measure the interactions between two tokens by computing the gradient of an output token with respect to an input token (pseudo-attention). Figures 5d and 5c show the pseudo-attention maps of HyperMixer in comparison to attention. We observe that the pseudo-attention weights of Hy-perMixer and attention are similar. This indicates that HyperMixer indeed learns an attention-like function. In contrast, we find these patterns to be weaker in MLPMixer (Figure 6, appendix).

Discussion
In the following, we first discuss the merits of our proposed model, which are the core contributions of our paper. We then discuss the scope of our analysis.

Impact
Best all-MLP model HyperMixer was designed as an MLP-based architecture with similar inductive biases as Transformers, which are beneficial for natural language understanding. Our hypothesis (H1) is that this leads to improvements over other MLP-based methods. Our experimental results support this hypothesis, as we find HyperMixer to outperform all MLP-based baselines on all datasets (Section 4.3).
Low cost model The main motivation for an MLP-based architecture is the efficiency benefits induced by its simplicity. Therefore, we hypothesized (H2) that HyperMixer would reduce the cost Cost(R) ∝ E · D · H to obtain an AI result R. This hypothesis is supported by our experiments.
While HyperMixer yields results that are on par with Transformer's results, it reduces the cost of all three cost factors: i) The cost of processing a single example (E) is lower, particularly for long inputs due to its linear complexity compared to the quadratic complexity of self-attention (Section 4.4).
ii) The number of required training examples (D) is reduced, as HyperMixer's relative performance improvement is larger in the low-resource scenario (Section 4.5). iii) HyperMixer requires less hyperparameter tuning than Transformers to reach good results, which is demonstrated by HyperMixer's higher expected relative improvements at low tuning budgets (Section 4.6).
Attention-like model Finally, our experiments on a synthetic task indicate that HyperMixer can learn very similar attention patterns as the selfattention mechanism in Transformers (Section 4.7), supporting hypothesis H3. While MLPMixer can also learn similar patterns given enough training data, we believe that it is the introduction of adequate biases that allows HyperMixer to learn these patterns efficiently. These biases were chosen based on an analysis of Transformer's success by Henderson (2020). HyperMixer's own success hence supports that analysis.
In summary, in our study, HyperMixer is the bestperforming MLP-based architecture, and shows comparable performance and behavior as selfattention at substantially lower cost. HyperMixer can thus be considered a low cost alternative to Transformers.

Scope
Small resource scenario It is important to note that our study is limited to the small resource scenario: Our models are small, not pretrained on large general-purpose corpora, and trained on datasets with fewer than 1 million examples. It is unclear if our results will also hold on larger scale. For example, while gMLP and FNet perform poorly in the low-resource scenario as demonstrated in our experiments, both models are able to narrow the gap to Transformer-based models as the resources for pretraining increase (Liu et al., 2021;Lee-Thorp et al., 2021). We hypothesize that with enough resources, these models are able to overcome their shortcomings in terms of inductive biases. However, there is no reason to believe that HyperMixer, being equipped with useful inductive biases, wouldn't perform on par with Transformers in high-resource scenarios while retaining its lower overall cost. Quite the contrary, HyperMixer's linear complexity in sequence length perhaps makes it more appropriate for large-scale pretraining on long contexts than vanilla Transformers.
Versatility One of the most impressive qualities of Transformers is their versatility: Not only are they now the standard architecture for all NLP tasks, but over the years they have also become ubiquitous in a wide range of applications domains outside of NLP. Of course, the present study cannot determine whether HyperMixer is as versatile as Transformers. However, subsequent studies have shown that HyperMixer has uses in speech recognition (Mai et al., 2023) and neural combinatorial optimization (Drakulic et al., 2023). Still, some modeling advancements are needed. For example, HyperMixing is not yet applicable for decoder models that make use of causal masking. As decoderonly language models have become widely studied, this constitutes promising future work.

Conclusion
While large pretrained Transformer language models have led to impressive progress, they require so much resources that many research labs are excluded from participation, leading to calls for Green AI. We have proposed an MLP-based method, HyperMixer, that, in contrast to previous MLP-based methods, is equipped with the same inductive biases that made Transformers so successful for natural language understanding. While it performs on par with Transformers, it incurs substantially lower cost in terms of processing time, training data, and hyperparameter tuning. Hence, we believe our study demonstrates the merits of MLP-based models for natural language understanding as an alternative to attention-based models, and we hope that the community pursues this direction further. Avenues for future work include large-scale pretraining, evaluation on a wider range of tasks and domains, and the model's adaptation to text generation.

Limitations
Many limitations of our study are already discussed in Section 5.2, however, we repeat and add to them explicitly here.
Small resource scenario Our study investigates MLP-based architectures for text classification tasks and finds competitive performance with vanilla Transformers while having lower cost in terms of the Green AI equation. However, the scope of our findings is naturally limited to the testing scenario, which is low-resource: Our models are relatively small, not pretrained on large generalpurpose corpora, and trained on datasets with fewer than 1 million examples. We may not say with certainty that our results will also hold on larger scale. For the sake of hypothesis-driven research we consider it more valuable to run many controlled small-scale experiments rather than few large-scale experiments. Nonetheless, scaling up should certainly be part of future research directions, as this is essential for optimal task performance.
Limitation to English pairwise sentence classification tasks Since token mixing is the independent variable in our study, we put our main focus on English sentence-pair classification tasks with textual input only, which we presume (and provide some evidence for) to be most useful to assess differences between token mixing models. Of course, vanilla Transformers are very flexible in the sense that, over the course of many studies, they have been shown to be very effective for a wide range of tasks, languages and data modalities. Whether or not the proposed HyperMixer model possesses similar flexibility cannot be answered in this study. The HyperMixer encoder arguably possesses similar inductive biases as Transformers. We thus expect it to be straight-forward to apply to tasks that are also solved well by Transformer encoders (e.g., span classification). For tasks such as language modeling, which involve a Transformer decoder, significant modeling advancements are required to obtain a HyperMixer equivalent. We consider this a very promising direction for future work.
Limitation to MLP-based baselines Similar to a trend in the computer vision community, our study investigates the suitability of MLP-based architectures for NLP. Due to their conceptual simplicity, these models promise to be easier to train, potentially leading to reduced Green AI costs. To this end we compare our proposed HyperMixer model to a range of other MLP-based models, and Transformers. Apart from FNet and Linear Transformers, which are efficient Transformer alternatives, we do not attempt an exhaustive comparison to non-MLP-based efficient NLP models. Hence, the scope of our claims does not extend to all efficient Transformer models. However, these models are of course very relevant to this study, as they are targeted towards one of the factors of Green AI cost (single forward pass complexity). Therefore, we regard a comprehensive comparison as valuable future work.  Apart from being problematic environmentally, they argue that the monetary cost of pretraining is too high to be widely accessible for most researchers. In a research community that focuses on task performance, low resourced researchers would be disadvantaged. Therefore, metrics that take the cost of reaching a result are important to consider (Schwartz et al., 2020). The metric Cost(R) ∝ E · D · H, is proposed and discussed in Section 1. However, reporting a single metric Cost(R) is often ambiguous. Therefore, in our experiments, we consider the factors E, D, and H.
To measure the computational cost per example E, Schwartz et al. (2020) propose a count of the floating point operations (FPOs) required. In our experiments, we adopt this metric and further include wall-clock time for a practical application. The component D evaluates the quantity of training data needed to reach a given accuracy or the performance of a model in a low-resource scenario (Hedderich et al., 2020;Chen et al., 2021). Finally, the component H measures the cost associated with hyperparameter tuning. This is reported using expected validation performance introduced by Dodge et al. (2019,2021), which computes the validation performance one would yield in expectation after k hyperparameter trials of random search (Bergstra and Bengio, 2012).
Current literature does not focus on all facets of Green AI as formalized as Cost(R). Typically, improving efficiency involves making existing models more accessible. For example, improving accessibility through model distillation (Sanh et al., 2019) or adapter modules (Houlsby et al., 2019). Another avenue involves reducing the computational complexity, with examples: prompttuning (Schick and Schütze, 2020), self-attention in Transformers (Child et al., 2019;Beltagy et al., 2020;Katharopoulos et al., 2020, et cetera). The latter approach is similar to our work. However, they focus the processing time of a single example E and do not consider the other facets of Green AI. In our paper, we focus on MLP-based approaches, which we argue will have improvements in all facets of Green AI due to their simplicity.

A.2 MLP-based Models
The vision domain has seen promising results with purely MLP-based models (Tolstikhin et al., 2021), however, they lack the desired inductive biases for NLP. Some desirable properties for modeling language include: i) position invariance, which is important for generalization, ii) adaptive size for variable-length inputs, iii) a global receptive field, which allows interactions to not be limited to small token neighborhoods, iv) learnabilty allowing for universal applicablility to various tasks, and v) dynamicity which implies that output is conditioned on the input. MLP-based models are typically not used for NLP as including the inductive biases of position invariance, adaptive size and global receptive field are non-trivial for MLPs.
Several methods try to overcome the lack of adaptivity to size by introducing shifting operations and local windows. Yu et al. (2022) and Lian et al. (2022) uses spatial shifting to pass the information of adjacent tokens through an MLP. (Tang et al., 2021) uses a circular shifting operator. However, the position invariance is violated because positional information is required in the decision of which tokens are included in the neighborhood. The aggregation of local information itself is done via a (relative) position-specific MLP. Global interactions are modeled only through the inclusion of enough layers or through a hierarchical layout (Yu et al., 2022;Guo et al., 2021).
For vision tasks it can be useful to exploit the fact that 2D images consist of two axes. Tatsunami and Taki (2021) make use of this fact by integrating a respective inductive bias. (Tu et al., 2022) achieve linear complexity by applying a gMLP (Liu et al., 2021) to only a single axis.
A global receptive field in MLP-based models is achieved through token mixing and a weighted summation of the inputs, similar to self-attention. This allows for interaction between tokens. Liu et al. (2021) propose the model gMLP, where the mixing weights are determined by a fixed learnable interaction matrix between positions. However, this comes at the cost of violating position-invariance, size adaptivity, and dynamicity. DynaMixer (Wang et al., 2022) enables dynamicity by estimating the mixing weights from the concatenation of the inputs via a linear layer. This is efficient due to a dimensionality reduction step, but the concatenation still implies position-dependence and fixed-sized inputs. (Lee-Thorp et al., 2021) proposes the model FNet to use static Fourier transformations to model token interactions. This model made significant improvements in computation cost, although the functions lack learnability and are position dependent. As such, they are directly downloaded by our training script. We apply no further preprocessing.

A.3 Hypernetworks
For computing expected validation performance, we use the public implementation by Dodge et al. (2019).
We run our experiments on single-GPU servers available to us as part of a computation grid, ranging between GeForce GTX Titan X and RTX 3090. Apart from Transformers on SNLI and MNLI, which take about 4 hours on slower GPUs, all experiments finished within 3 hours.
Hyperparameters We provide CSV files detailing all parameters of every run alongside their results in the supplementary material, ensuring reproducibility of our study. Note that the computation environment (e.g., type of GPU) might lead to small differences.

B.2 Peak Performance
To ensure a fair comparison, we aim to compare models of approximately the same number of parameters (≈11 M parameters). All models have 6 layers with token embedding size d = 256 and hidden size d ′ = 512. For MLPMixer and gMLP we set the size of the token mixing modules to N = 250 and N = 100, respectively. These lengths are chosen to match the number of parameters of the other models (11 M). The hidden layer size is set to 512 in all models. We use dropout at the input to each layer with a probability of 0.1. For all models, including the ablations, we first tune the learning rate of Adam (Kingma and Ba, 2014) using a logarithmically spaced grid of 7 values α ∈ {0.001, 0.0005, 0.0002, 0.0001, 0.00005, 0.00002, 0.00001} on the validation set. For our baselines, we then evaluate 10 different seeds and report the mean accuracy and standard deviation on the validation set. On the test set, we only report the results of the model yielding the best results on the validation set, as the GLUE benchmark (Wang et al., 2018) has a hidden test set with limited access. Ablations are evaluated on the validation set with a single seed.

B.3 Time per Example
Due to the lack of reliable software to measure FOPs in PyTorch, we calculate these numbers manually. Our process is described in Appendix D. For the measurement of wallclock time, we measured the time of 1,000 batches through a single layer of  This section gives more detail about how we set up the synthetic example (Fleuret, 2019) for evaluating whether the different models were able to learn some attention-like transformation. We have a dataset made of 1D sequences that contain two rectangular and two triangular shapes. Each of these shapes has a different height taken at random in the input sequence. The output sequence has the same shapes in the same positions, but the heights of triangular shapes should be the mean of the two triangular shapes in the input sequence. Similarly, the height of the rectangular shapes in the output sequence is the mean of the height of the two rectangular shapes in the input sequence. So the model should be able to see across the sequence and compute the mean of the two different shapes to succeed at the task. All the models considered for this task have a similar structure: they consist of a particular layer (MLPMixer, Hy-perMixer, or Attention) surrounded by two pairs of 1D-convolutional layers with kernels of size five and a symmetric zero-padding of size two so that the output shape is constant. We made an ablation to ensure that this layer was mandatory by changing it with another similar 1D convolutional layer, which corresponds to None in the figure 5b.
Before visualizing the pseudo-attention maps, all models were trained on 25,000 training examples. We use input-gradients (Simonyan et al., 2014) to evaluate whether models could « attend » to the different shapes. This method computes the gradient of the output sequence with respect to the input sequence, giving the corresponding saliency map, which can then be recombined into a pseudo-attention matrix where the i-th column corresponds to the saliency maps of the i-th output token. A large value in the (i, j) entries of the pseudo-attention matrix means that the output to-ken i strongly depends on the input j, and we can thus compare it to an attention matrix 6a. Figure 6 represents the pseudo-attention matrices for the different models. We can notice that it indeed approximates the true attention matrix 6a and that the model with no special layer cannot attend to the correct part of the sequence, as expected. Finally, we can see that the pseudo-attention of the Mixer layer is not as peaked as the one corresponding to the Attention or HyperMixer layer.

C.1 Validation Set Results
In Table 3, we show the best scores on the validation set that we obtained from the grid search (using a fixed seed), alongside the learning rate that yielded that score.
In Section 4.3, we reported the test set results of all models when using the best-performing seed.
In Table 4, we show test set results when using the median seed.

C.2 Ablations
We first describe the ablation models before we discuss their results.
Feature Mixing Only The most simplistic MLP architecture is one that doesn't use token mixing, i.e., the token mixing module is set to the identity function. The outputs at the last layer are aggregated via average pooling before plugged into the linear classifier. This allows a baseline where the token interactions are not modeled. Therefore, this architecture serves as a control for how important token mixing is in any given task.
Token Mixing Only A simplistic single layer MLP architecture ablation. This model consists of a variable dimension MLP where the weights are generated using a hypernetwork which only allows for location interaction. This model is included to argue that the best simple model requires both location and feature mixing to efficiently model textual inputs.
Shared Weight-Vector A simple way to obtain a variable size location-mixing MLP is by weightsharing. Concretely, we use a single learnable weight vector w 1 ∈ R d ′ , which we copy N times to create a weight matrix W 1 ∈ R N ×d ′ . Analogously, we create W 2 from a separate vector w 2 . Note that this baseline does not support dynamicity, as the   weight vector is independent of the inputs. This baseline thus shows the importance of dynamicity in our model.

Results
Results are shown in Table 5. Untying the hypernetworks in HyperMixer leads to slightly decreased performance on all datasets. We hypothesize that without pretraining, the model cannot benefits from more capacious token interaction modeling introduced by untying. Nonetheless, the untied model still performs or a little better than vanilla Transformers. While the introduction of MLPMixer and similar models follows a trend towards conceptually more simplistic models, our ablations show, perhaps unsurprisingly, that simplicity is not better when it leads to discarding information, as both the Feature-Mixing only and Location-Mixing only models perform substantially worse than the full HyperMixer model. Moreover, it is not enough to use the same learnable weight vector for all positions (Shared Weight-Vector), indicating the importance of generating the MLP based on the input.
The simplistic Feature-Mixing only model performs poorly on all datasets except SST, where it performs as well as the other models. This indicates that many instances in SST can be solved by looking at individual tokens alone, rather than modeling their interactions. Figure 6 shows the pseudo-attention of all models (except 'None') alongside the true attention weights of attention. First, it should be noted that pseudo-attention weights offer a somewhat blurry version of true attention weights, where high weights occur at positions that correspond to the same shape (cmp. 6a to 6b). Second, we observe that the pseudo-attention weights of HyperMixer and attention (cmp. Figure 6d to 6b) are similar. This indicates that HyperMixer indeed learns an attention-like function. Third, MLPMixer also shows a similar pattern, but the relevant positions have weak connections (Figure 6c). This confirms our finding that MLPMixer requires substantially more training data to learn strong connections.

D Comparison of #FOP
We want to compute the number of floating-point operations needed in self-attention vs. HyperMixing for a single example. Let N be the sequence length, d be the embedding size of each token, and d ′ the hidden dimension.
For simplicity, we will assume basic mathematical operators like exp, tanh, √ x and division to be equal to one floating operation. However, their actual cost is higher but depends on implementation and hardware.

D.1 Basic Building Blocks
We first compute the number of operations infrequently occurring in basic building blocks of neural networks.

Matrix Multiplication
Multiplying matrix A ∈ R N ×d A ∈ R d×M takes 2d(N M ) operations, as 2d operations are needed for a single dot-product and there are N M entries in the resulting matrix.
Linear Layer Passing a single vector of size d through a linear layer without bias of size (d, d ′ ) is the multiplication of a single vector with a matrix, i.e., incurs 2dd ′ operations in total.
GELU GELU is usually approximated as GELU(x) = 0.5x 1 + tanh 2/π(x + cx 3 ) So in total, GELU is computed for every of the d features and every of the N vectors, meaning the GELU activation layer takes 9dN operations. Mixing MLP The mixing MLP has input and output size N and hidden size d ′ , which is applied to each of the d embedding dimensions (i.e., after transposition), incurring d(4d ′ N + 9 ′ ) operations in total.

D.3 Self-attention
Multi-head self-attention with h heads applies selfattention independently to each head consisting of vectors of size d/h, respectively. • a weighted average for each of the inputs, consisting of (2dN 2 ) operations.

E Connection with Lambda Layers and Linear Transformer
We saw in Section 4.7 that HyperMixer was able to allow a form of attention without computing an attention matrix directly and thus scaling only linearly with the input length. In that regard, this method is similar to other methods such as (Bello, 2021) or (Katharopoulos et al., 2020). We will describe here the difference between these approaches and our method. Let us write the standard attention formula and the HyperMixer layer under the following form: Attention(Q, K, V ) = softmax(QK T )V (2) HyperMixer(X) = W 1 σ(W T 2 X) where Q, K, V , W 1 , W 2 ∈ R N ×d ′ , X ∈ R N ×d and W 1 , W 2 are the weights generated by the hypernetwork.
We can notice that the two operations differ mainly in the non-linearity location and the uses of linear or non-linear projection of the input. Indeed, attention applies a non-linearity to QK T and uses linear projection of the input (Q, K, V ) to construct the attention map. On the contrary, HyperMixer uses two linear mapping of the input (W 1 , W 2 ) and applies a non-linearity to W T 2 X, which is similar in a way to K T V . The quadratic cost of the attention layer comes from the place of the non-linearity as it requires the explicit computation of QK T ∈ R N ×N which is quadratic with respect to the input size. Most of the strategies used to overcome this quadratic cost generally find a way of moving this non-linearity. This is the case of (Katharopoulos et al., 2020) which applies non-linearities ϕ independently to Q and K and (Bello, 2021) that applies softmax only to K. In that regard, these two methods can be compared with HyperMixer as they all scale linearly with the input size due to the non-linearity location. Still, HyperMixer is conceptually different because it uses a non-linear transformation of the input and because it uses, in our opinion, a simpler and more understandable design entirely based on MLPs.

F Ablations on Transformer Layout
While all Transformer layouts have a feature mixing and a token mixing component in each layer, the arrangement and connection of these components through skip connections and normalization layers remains an open question. The original Transformer paper (Vaswani et al., 2017) uses what is now known as the "post-norm" layout: x 1 = LayerNorm(x + token_mixing(x)) x out = LayerNorm(x 1 + feature_mixing(x 1 )) where x ∈ R N ×d is the input to the layer, and x out ∈ R N ×d is the output of the layer.
As Figure 1 shows, this is the model we have fixed for all previous experiments. In the following, we combine each of the presented layouts with self-attention and HyperMixing, respectively. Since we noticed early that the training with HyperMixing is not stable with some of the layouts, we also experimented with adding two different kinds of normalization to HyperMixer: layer normalization applied after TM-MLP, as shown in Algorithm 1, and length normalization. For the latter, we simply scale the generated weight matrices by 1 M , where M is the number of keys. The intuition is that this keeps the magnitude of activations in the hidden layer of TM-MLP approximately the same across different input lengths.
The results show that self-attention is relatively insensitive with respect to the type of layout, as all models except for ReZero attain an accuracy of 76-77% on average. In contrast, HyperMixer without normalization performs substantially worse with prenorm, ReZero, and the parallel layout. Length normalization mitigates this problem to some degree, but the addition of layer normalization yields the overall best results, where all models achieve between 77 and 78% of accuracy on average. We, therefore, recommend adding layer normalization by default when using HyperMixing in a new context.