Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) have proven effective in scaling up Transformers model size for \textit{pretraining} large language models. By only activating part of the FFN parameters conditioning on input, S-FFN improves generalization performance while keeping training and inference costs (in FLOPs) fixed. In this work, we analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method under a general conceptual framework of sparse neural memory. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. We found a simpler selection method -- \textbf{\texttt{Avg-K}} that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining compared to existing MoE architectures including Switch Transformer (Fedus et al., 2021) and HashLayer (Roller et al., 2021).

One promising direction is sparse scaling which increases the number of parameters while keeping the training and inference cost (in FLOPs) fixed.Recent work focuses on scaling up a transformer's feed-forward network (FFN) with sparsely activated parameters, resulting in a scaled and sparse FFN (S-FFN).There have been two major approaches to achieve S-FFN.One treats S-FFN as a neural memory (Sukhbaatar et al., 2015a) where a sparse memory retrieves and activates only parts of the memory cells (Lample et al., 2019).The other adopts Mixture-of-Expert Network (MoE) (Lepikhin et al., 2021;Fedus et al., 2021;Du et al., 2021;Roller et al., 2021;Lewis et al., 2021;Chi et al., 2022) that replaces a single FFN module with multiple equal-sized ones (called "experts") and only activates a few among many experts for a particular input.
While both memory and MoE models achieve S-FFN, they have been considered two completely different approaches.We aim to draw the connections between these two classes of S-FFN: What critical design choices do they have in common?Which design choices are essential for their modeling capability and computation efficiency?Can the effective ingredients of each method be transferred and combined to improve performance further?
In order to answer these questions, we start from the neural memory view of FFN (Sukhbaatar et al., 2015a) ( §2.1) and reduce all S-FFN to the same mathematical form ( §3.1).Then, we characterize various S-FFN methods along two dimensionsmemory block size (e.g.expert size) and memory block selection method (e.g.gating) ( §3.2).
Using this framework, we made the following contributions: • We study a wide range of memory block sizes besides common block size in MoEs (Fedus et al., 2022)  a larger one, a smaller block size can keep improving the perplexity with little incurred extra FLOPs ( §5.1 §5.2), leading to better perplexity/computation trade-offs.
• We conduct a systematic exploration of block selection methods to quantify their relative efficacy and efficiency ( §5.2).Specifically, we find that the selection method through a gating function, in general, improves the FLOPs-Perplexity trade-off.However, the parameterization of the current MoE gating function has worse perplexity than using the FFN hidden states.
• Drawing on insights above, we propose a simple gate for S-FFN-Avg-K ( §3.3) as a hybrid design choice between sparse neural memory and mixture of experts.It efficiently selects memory blocks based on the mean aggregated hidden states of each block.With 1% additional FLOPs, Avg-K achieves 2.16 lower perplexity than a vanilla transformer (16.96),outperforming Switch Transformer (16.45).
Avg-K is the first MoE model performs well without load balancing constraint, while conventional MoE transformer like Switch Transformer will degenerate (Fedus et al., 2021;Shazeer et al., 2017;Eigen et al., 2014).

Background 2.1 Feed-Forward Network
A transformer layer (Vaswani et al., 2017) consists of a self-attention block and a Feed-Forward Network (FFN) block.FFN receives an input vector x ∈ R d from the self-attention block, multiplies it with K ∈ R dm×d , applies an non-linear function to obtain the hidden states m ∈ R dm and applies another affine transformation V ∈ R dm×d to pro-duce a d-dimensional output y (Eq.1).Viewing FFN as a multi-layer perceptron, we could write it as: Additionally, we could also view it as a neural memory (Sukhbaatar et al., 2015b(Sukhbaatar et al., , 2019;;Geva et al., 2021) (Eq.2).
In this view, FFN consists of d m key-value pairs, called memory cells.Each key is represented by a d-dimensional k i ∈ R d , and together form the key table; likewise, a value table V ∈ R dm×d .The memory multiplies the query input x ∈ R d with every k i ; followed by the non-linear function, it produces memory coefficient m i = f (x • k i ) for the i-th memory cell.Finally, the output of FFN is the sum of its values v i weighted by their corresponding memory coefficient m i .Conventionally, the size of FFN -d m -is set to be 4 • d.

Scaling up FFN
As discussed in §1, scaling up the number of parameters in FFN serves as a lever to improve transformer performance.Since a conventional FFN already takes about two-thirds of a transformer layer's parameters (Geva et al., 2021), scaling up FFN will greatly affect the parameter size of a transformer model.However, one could sparsely activate the parameter to control the required compute.
In this section, we will review two lines of work to achieve a scaled and sparse FFN (S-FFN).One has a mixture-of-expert model activate a few experts ( §2.2.1), and the other specifies a memory model to sparsify ( §2.2.2).

Mixture of Experts (MoE)
Mixture of experts (MoE; Jacobs et al. (1991)) consists of a set of expert models {f i (x)} B−1 i=0 and a gating function g : R d → R B to estimates the importance of each expert.Finally, the output is the sum of experts' output weighted by the gate's weight estimation for that particular expert.
Recent work (Du et al., 2021;Lepikhin et al., 2021;Roller et al., 2021;Lewis et al., 2021;Zhou et al., 2022) adopt it to transformer by treating an FFN as one expert and Sparsely activating the MoE (SMoE).In SMoE, the gating function (or "router") routes an input token x to a subset (e.g. 1 or 2) of experts -E = subset(g(x)).Conventionally, SMoE enforces load balancing constraints to avoid overly using a few experts while underutilizing others, and converging to local optima (Shazeer et al., 2017;Eigen et al., 2014).Various SMoEs mainly have two types of gates: corresponding to each experts.The importance of the i-th expert is obtained by g i (x) = exp(e i •x) j exp(e j •x) .To enforce load balancing when routing, SMoEs use an additional auxiliary loss (Lepikhin et al., 2021;Fedus et al., 2021;Artetxe et al., 2021;Du et al., 2021;Chi et al., 2022) or frame expert utilization as a constrained optimization problem (Lewis et al., 2021;Zhou et al., 2022).
Static gate, in contrast to a learnable gate, does not have any differentiable parameters.Instead, it uses a static mapping that encodes load-balancing constraints to route input (Roller et al., 2021;Gururangan et al., 2021).For example, RandHash from HashLayer (Roller et al., 2021) uses a hash table that maps from token type to randomly selected expert(s).DEMix (Gururangan et al., 2021) ensures i-th expert only sees data from i-th domain.

Sparse Neural Memory
The other line of work follows the memory view of FFN (Eq.2).It is straightforward to increase the memory size d m to a much larger value d m ≫ 4 • d.By only using the top-k entries in m = x • K ⊤ , one could sparsely activate the value table, resulting in a vanilla sparse memory (VanillaM).However, the straightforward application results in dense computation proportional linearly to the memory size.Lample et al. (2019) explored the following two techniques in this direction to scale computation sublinearly.
Low-Rank Key Memory (LoRKM) A straightforward technique is to assume that the key table is composed of and approximated by a downprojection D ∈ R d×d ℓ and a low-rank key table Product Key Memory (PKM) Building upon LoRKM, PKM further decomposes the low-rank key table by assuming different low-rank keys have structured sharing with each other.We defer the technical to Lample et al. (2019).Due to such factorization, PKM has a negligible key table K • D ⊤ (e.g., < 0.3%) relative to the parameters in the value table.

A Unified View of Sparse FFNs
We show the connections between MoE and neural memory despite their different surface forms.We first derive an equivalent form of MoE to establish its connection with sparse memory ( §3.1).Then, we propose a unified framework for S-FFN ( §3.2).

A Closer look at MoE
MoEs use a gating function to estimate the importance of all experts and combine each expert's output through linear combination.Here, inspired by the memory view on FFNs ( §2), we could view MoE as a big memory except that it is chunked into B FFNs, where FFN (i) denotes the i-th FFN: In Eq 4, v l is the l-th row of the stack of d msize value tables from B FFN experts; and m l = m i•dm+j is its memory coefficient obtained from For example, m dm is is obtained by weighting m (1) 0 -the 0-th memory coefficient in the 1-th FFN expert -with g 0 (x) -the estimation of the 1-th FFN from gate.Building upon such memory view, one could see that is a sparse memory operating in terms block and uses its gate to narrow down the summation over the stacked value tables V to some FFN (i) , for i ∈ subset(g(x)).
Comparison with Sparse Memory Both SMoE and Sparse Memory are neural memory, but there are several differences: 1) whether memory cells share the same importance weight: in sparse memory, each memory cell receives an individual weight.In contrast, in SMoE, each groups of 4 • d memory cells share the same importance weight g i (x).2) Memory selection criterion: Sparse memory uses a single dot product between input token and any key vector x • k for both, whereas SMoE depends on a separately parameterized gate g.

The Unified Framework
We propose a general framework that unifies the two different approaches to achieve S-FFN.We cast both as instances of a memory with large key and value table -K ∈ R dm×d k , V ∈ R dm×dv , where d m ≫ 4 • d.We distinguish the different methods along two dimensions illustrated below and summarized in Table 1: Memory block size specifies how many memory cells share the same importance weight at selection time, and thus together treated as a memory block.We use g to denote the size of one block.In other words, we split the K, V along the d m -dimension into g-size blocks.Therefore, a memory consists of B = d m /g blocks in total.Formally, we write For example, sparse memory has block size g = 1 -trivially treating 1 memory cell as a "block"; and SMoE has the block size g = 4 • d ( §3.1).Current approaches generally use fixed block sizes, but this is mostly an artifact of how the methods were derived rather than a mathematical constraint.For example, we can design SMoE versions instead of 1 expert of size 4 • d, or uses 2 experts of size 2 • d.We can similarly chunk memory coefficients m into blocks of size g in sparse memories.
Memory block selection method is the specific function that compute the importance of each memory blocks for selection.Since SMoE is also a type of sparse memory, we distinguish the selection method by a new criterion -whether one allows input x to directly interact with the key table K g .As discussed in §3.1, SMoE uses the estimation from an individually parameterized gate to select, while sparse memory solely and directly uses a key table.Thus, current SMoE is a type of indirect selection method, and sparse memory a direct one.Various SMoEs are further characterized by whether their gating function has learned parameters or consists of a static mapping ( §2.2.1).Meanwhile, sparse memory is characterized by how much factorization the key table uses ( §2.2.2).

A New Selection Method -Avg-K
Shown by Fig. 3, a gated MoE has clear advantage over sparse memory.Basing on the contrastive analysis from §5.2, we believe the current MoE already has a good design choice -a reasonably large memory block size and a full-parameter key table.However, an important improvement is to make more use of each expert's key table for routing tokens.Additionally, it needs to be computeefficient with a reasonably small memory block size g ( §5.1).
To this end, we propose a new routing method -Avg-K-as a hybrid design choice between sparse neural memory methods and mixture-of-experts methods.We don't enforce load balancing in Avg-K experiments ( §5.3).
With Avg-K, we represent each block with the average of its key table ).Then, we use the dot product between x and the averages to select the top-b selected block and route the token there for memory calculation (Eq.2): Due to the linearity of averaging, the operation e i • x is equivalent to calculating the average of dot products within a block without GeLU.Since all tokens share the averages, our method is efficient.We provide more rationale for our choice of average function in Appendix C.1.(Fedus et al., 2021), GShard (Lepikhin et al., 2021), GLaM (Du et al., 2021), BASELayer (Lewis et al., 2021), X-MoE (Chi et al., 2022) Static gate HashLayer (Roller et al., 2021), DEMix (Gururangan et al., 2021) Table 1: S-FFN methods decomposed along the defined design dimensions.
4 Experiment Setup

Models
We choose Dense Baseline using transformer architectures used in GPT-3 models (Brown et al., 2020), which has 24 transformer layers, with d = 1024, GeLU activation functions and with a memory size (or FFN hidden size) to be 4 • d.
S-FFN Given a model above, we replace some of its FFNs with an S-FFN.Similar to (Lepikhin et al., 2021), we replace the FFN at every 6 layers, leading to 4 S-FFNs in total across 24 layers.Since memory block size g is a perception imposed upon memory cells, we use k to denote the number of active memory cells and control how activated the S-FFN is.We use the formulation of to control the size of S-FFN, so the S-FFN will activate b = k g out of B = dm g memory blocks.In Table 2, we list all S-FFN models used for analysis in §5.We count FLOPs analytically following Narayanan et al. (2021) and do not account if a worker finishes computation before another (when using model parallelism).We use the number of learnable parameter to consider whether two models are equally expressive.In Table 2, we list all S-FFN models used for analysis in §5.
PKM-FFN Since the factorized key table in PKM has little (< 0.3%) learnable parameter relative to the value table, we propose an indirect variant called PKM-FFN to match the number of parameter of other models like RandHash.This variant has memory block size g = 1 and the same key-value table as RandHash.PKM-FFN has a gate whose g(x) is the same as the m from a PKM and g i = m i ; and no load-balancing is enforced.

Language Modeling
Pretraining Data We pretrain all S-FFN models on a total of 453GB text with 112B tokens from a union of six English-only datasets, including English subset of CC100 and the five datasets used to pretrain RoBERTa (Liu et al., 2019) -specifically BookCorpus, English Wikipedia, CC-News, OpenWebText, CC-Stories (details in Appendix A.3).We adopt the same Byte-Pair Encoding as GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019) with a vocabulary of 50K subword units.All models are trained for 60B tokens for convergence.
Evaluation settings We evaluate our models' ability to predict the next token in a sequence as measured by perplexity.We report both in-domain and out-of-domain perplexity to indicate generalization ability.For out-of-domain, we use data from The Pile (Gao et al., 2020), a public dataset that combines data from 22 diverse sources.

Analysis Results
In this section, we use the proposed unified view to systematically study the design choice of S-FFN.Specifically, (1).we study a wide range of block sizes other than the incidental choice used in existing work and investigate its impact on language modeling perplexity ( §5.1).(2).Both direct and indirect block selection methods lead to lower perplexity than a standard FFN, but which type of method has better FLOPs-Perplexity trade-off and what are the relative efficacy and efficiency of different methods require further study ( §5.2).

Memory block size
Since block size is a natural number, we aim to answer a straightforward question -given VanillaM originally has a block size g = 1 and selects top-k scalars in memory coefficients m = GeLU(x • K ⊤ ).We made a minimal change to extend it to larger block size g: given m, we chunk it into B blocks - 1) ]; then, we select the top-b blocks using the average of each block -Avg(GeLU(x•(K (i) ) ⊤ ),dim=0).1 In Fig. 2, we observe that smaller block size leads to an improvement of 0.4(15.75→ 15.35) perplexity for RandHash and an improvement of 0.87(15.56→ 14.69) for VanillaM.In Appendix B.2, we provide theoretical justifications for this observation which shows that a smaller block size improves model capacity by including more combinations of memory cells.For example, with g/2, half memory cells of expert-1 could be activated together with half of the expert-2; however, this combination is impossible with larger block size.

Memory block selection method
Next, we investigate the impact of the selection method, specifically, the FLOPs-perplexity tradeoff for direct and indirect methods to determine the overall usefulness of each S-FFN method.
FLOPs-perplexity trade-off We study the efficiency of direct and indirect selection methods in S-FFN models characterized by FLOPS-perplexity trade-off.We conduct experiments across different scales of the memory by varying E ∈ {4, 16}; additionally, we run E = 32 for PKM.
In Fig. 3, we marginalize different factors used in the two selection methods -i.e.types of gates, factorization techniques on key table, etc. -and consider each type of selection method as a whole.When we change different marginalized factors, we observe that indirect methods tend to improve more as we use more FLOPs (with larger memory sizes controlled by E).Thus, the indirect method has a better FLOPs-perplexity trade-off.

Effect of gating function
We start with contrastive comparisons among PKM-FFN E=16 , PKM E=32 , RandHash E=16 with memory block size g = 1 and 4096 active memory blocks.From the three parameter-matched models, we can learn important lessons to improve the design of gate: 1. Comparing with PKM-FFN E=16 , PKM E=32 essentially moves the parameters from a fullparamter key table to double the size of value table.
2. PKM-FFN E=16 and RandHash E=16 have the same (in size) key and value tables; but the former uses gate jointly learned with key table, while the later uses a learning-free gate.
As shown in Table 3, on out-of-domain, PKM-FFN E=16 outperforms PKM E=32 (16.06) by 0.87 perplexity and slightly outperform RandHash E=16 by 0.16.Therefore, it is essential to have a fullparameter, and thus expressive enough, key table to produce memory coefficients.
Table 3 shows the improvement of VanillaM E=16 , PKM-FFN E=16 , RandHash E=16 over Dense Baseline (16.96) are 2.27, 1.77, and 1.61 respectively on out-of-domain.They only differ by how much they depends on key table for selection -VanillaM directly uses it, PKM-FFN indirectly gain information from it, and RandHash completely ignores it.Due to the consistent observation across out-of-domain test sets, we conclude that the more dependent on key table the selection method is, the better language model it will lead to; and indirect usage (PKM-FFN) is not enough.

Results on Avg-K
Language Modeling Pretraining.256, the improvement is less significant than that of Avg-K.The comparison suggests that, with a larger block size, GeLU activation protects the average operation in VanillaM (after GeLU) affected by (potentially many) large negatives because lim x→−∞ GeLU(x) = 0.In contrast, with a smaller block size, this "negative value" problem is mitigated because there are more blocks available for selection.Since negative dot products affect Avg-K more, it prefers blocks with more or very positive dot products, whereas VanillaM is protected from negatives, so it might fail to detect those blocks.Therefore, Avg-K could achieve an even slightly better perplexity than VanillaM for block size g ≤ 256.See full discussions in Appendix C.2.
In Figure 7, we also include a load balancing analysis of Avg-K.To our surprise, the mode collapse (i.e.imbalanced usage of memory block)  (Johnson et al., 2021) and ScaNN (Guo et al., 2020).One successful example is applying vanilla Locality-Sensitive Hashing to Reformer (Kitaev et al., 2020).However, in our preliminary study, we found that perplexity is greatly affected by the search quality, and building a data structure after every update is expensive and hard to avoid.We leave detailed discussion to Appendix D.2.

Conclusion
We provide a unified framework for designing sparse FFN in transformers and analyze existing S-FFN methods such as SMoEs in the language modeling task.Using this framework, we found that smaller memory block (e.g.expert) size improves perplexity at the cost of slightly higher computation cost.Selection methods with gates have better FLOPs-Perplexity trade-offs than without, while the gating function in current SMoEs is suboptimal.This framework enables us to instantiate a simpler S-FFN architecture that outperforms SMoEs while still being efficient in training and inference.

Limitations
Limitations of a smaller block size g With model parallelism (Lepikhin et al., 2021) As discussed in Appendix B.3.2, smaller memory block size will induce higher communication cost given the current all_to_all-based implementation framework (e.g.Switch Transformer).We think reducing memory block size to 1 is too extreme to be practical; and there should be a sweet spot between 1 and 4096 (or the chosen expert size) allowed by the implementation and hardware status.
Limitations of the unified framework Since our method Avg-K essentially applies an average pooling to the key table K g , a better alternative may exist.Our method also heavily depends on dot product information, but this might not be the best information to be used.Due to the curse of dimensionality, future work might want to focus on finding a better metric than dot product and other aggregation methods than average to measure distance between high-dimensional vectors.
Also, we didn't train Avg-K with load-balancing due to our current limit in budget, but we include our rationale in Appendix C.3 for why Avg-K should work with load-balancing.
Additionally, in large-scale SMoE training, the speed is limited by the most heavy-loaded GPU when model parallelism is used.Therefore, load balancing is essential.We also note that our scale is relatively small and does not use model parallelism, so the problem is not pronounced for us.Future follow-up should look at how to incorporate load balancing into the unified framework and inspire better actionable design choice.We think such unification requires more advanced theoretical connection with memory block and block selection method, which likely involves consideration of training procedure.

Ethics Statements
Due to the nature of pretraining, the carbon footprint of our work is large estimated by the amount of FLOPs and GPUs reported in the paper.We did make the effort to minimize the cost at design stage of the project.In our preliminary study, we ask for recommendation from one of the authors of Artetxe et al. (2021) to choose and verify the minimal model size and amount of tokens that sufficiently differentiate different design choices.
Another ethical concern of the paper is from the pretraining data we use.As we used the same data source as (Artetxe et al., 2021), we refer the reader to the ethics statements in (Artetxe et al., 2021) for how much trained model absorbs bias and toxicity from training data.

Limitations
EMNLP 2023 requires all submissions to have a section titled "Limitations", for discussing the limitations of the paper as a complement to the discussion of strengths in the main text.This section should occur after the conclusion, but before the references.It will not count towards the page limit.
The discussion of limitations is mandatory.Papers without a limitation section will be deskrejected without review.ARR-reviewed papers that did not include "Limitations" section in their prior submission, should submit a PDF with such a section together with their EMNLP 2023 submission.
While we are open to different types of limitations, just mentioning that a set of results have been shown for English only probably does not reflect what we expect.Mentioning that the method works mostly for languages with limited morphology, like English, is a much better alternative.In addition, limitations such as low scalability to long text, the requirement of large GPU resources, or other things that inspire crucial further investigation are welcome.
No. of such block assignments In Fig. 4a, 4b, we evaluate our model with E = 16 on our validation subset and calculate the estimations across various g.It is observed that less sharing happens as block size decreases.However, the empirical estimation for RandHash are relatively constant across granularity.We suspect this is due to the Zipf's law of tokens.Also, we note that the magnitude of E[r] are different for different methods.We defer the reason of this phenomena to future work.

RandHash is efficient for computation because a hash table theoretically has time complexity O(1).
In contrast, a conventional learned gate 2.2.1 has an d-dimensional embedding for each memory block.Therefore, with total of B memory blocks, it has the time complexity of O(d • B).In Table 8 we show how the FLOPs percentage of learned gate in a single forward-backward computation changes with respect to the change in memory block size, where we assume setup in §4 is adopted.

B.3.2 Cost of communication
The conventional training framework of MoE (Fedus et al., 2021) depends on all_to_all operations (Paszke et al., 2019) to route tokens to different devices.One might expect the communication cost remains the same if the number of device doesn't change.However, this assumes the tokens are identified by their type.In fact, the training framework further identify the routed tokens type by the experts it routed to.Therefore, the communication cost scales linearly with respect to the change in the number of memory block.

C Avg-K
C.1 Rationale to use Avg in Avg-K We heavily base our choice on experiments with aggregators in VanillaM (in Table 7).From the experiments with average absolute value (after GeLU), we hypothesized that a positive feature is good at predicting the value of a label/token against all others.In contrast, a negative value is good at negating the prediction of a single token.As such, positive features are more predictive than negative ones.Although the situation might be different for Avg-K (before GeLU), we expect the selection will only be affected more because of the larger impact of negative value.Also, we consider the experiment with maxpooled hidden states (i.e., Max(•)).This experiment shows that a memory block hardly has a single keyvalue cell that dominates over others since Max(•) underperforms Avg(•) and Avg(| • |).What makes it worse, the max operation will overlook lots of hidden states at selection, but the overlooked hidden states still contribute to the computation.In contrast, the performance increases when we consider the average (or average of the absolute values) where every hidden state contributes to the decision.Although the situation is slightly different in Avg-K, the "max-pooled" version of Avg-K will only overestimate the hidden states information even more, and the aggregated value won't be indicative of the hidden states used for computation.
The last consideration we have is that the average function is linear.When we select experts, we use the dot product between input and averaged keys.Due to the linearity, this value is equivalent to taking the dot product between the input and every key and taking the average (See Appendix C.2). Thus, using this design choice saves a great amount of computation compared with VanillaM, while keeping the neural memory analogy.

C.2 Avg-K analysis through comparison with VanillaM
Avg-K essentially applies an average pooling to the unfactorized K g to create representation of each block.Due to the linearity of averaging, the operation e i • x is equivalent to calculate the average of dot products within a block before GeLU and select blocks with the average of dot products: In contrast, VanillaM uses average after GeLU( §5.1): Because GeLU is a non-linear function, average from Avg-K could be shared across tokens.In contrast, VanillaM can't, and thus making Avg-K effi-cient.
In Fig 6, we experiment both methods with various g.We observe when g decreases from 4096, the perplexity of Avg-K drops more drastically than VanillaM.We believe this observation highlights the impact of GeLU.Because lim cluded more and potentially very negative values to average over, and thus leads to worse choices than ones made by VanillaM.On the other hand, when g decreases, this "negative value" problem is mitigated.When there are more blocks available for selection (smaller g), because negative dot products affects Avg-K more, it prefers blocks with more or very positive dot products; whereas, VanillaM is protected from negative value so it fails to detect those blocks.Therefore, Avg-K with g ≤ 256 could achieve an even better perplexity.
C.3 Why Avg-K with load balancing should work?
Comparing VanillaM and Avg-K, one would expect Avg-K to be greatly affected by extremely negative hidden states (before GeLU).Yet, the final model with Avg-K could even outperform VanillaM with the same block size (Fig. 6).This means the model will accommodate small design changes.Additionally, the requirement of a load balancing loss is determined by the sparsity of gradients.If one "expert" gets updated with gradients while other "experts" are starved then the one expert will be selected all of the time leading to a form of mode collapse.In fact, with the same memory block size (g = 4096), we are surprised to observe Avg-K (w/o load balancing loss; row 6 in Table 4) could still perform on par with Switch (w/ load balancing; row 4 in Table 4).As our load balancing analysis suggests in Appendix C.4, when the number of experts is small, the mode collapse issue in Avg-K is severe.This makes us more confident that Avg-K will perform better with standard load balancing loss added.
Therefore, we believe that with the loss added, the model will accommodate and could still perform competitively.

C.4 Avg-K load balancing analysis
On the same validation set as used in §B, we also conduct a load balancing analysis of memory blocks.Fig. 7 shows that Avg-K and VanillaM disproportionally used some memory blocks.

D Preliminary study for related work D.1 Terraformer analysis
Controller in Terraformer Jaszczur et al. (2021) uses a controller to score all memory cells and preselect a subsets -Controller(x) -for computation.
This is closest to our PKM-FFN, since their controller is essentially a gate with low-rank key table in LoRKM-g(x) = (x • D) • (K ′ ) ⊤ , where D ∈ R d×d ℓ , K ∈ R dm×d ℓ , and d ℓ ≪ d.The difference is that they additionally assume the estimation from gate (and memory) could be seen as chunked into blocks and only select top-1 memory cell scored by the controller from each blocks: where j * = arg max j g(x) (i) j .Therefore, their number of active memory cells k is equal to d m /g.Similar to our contrastive pair of PKM-FFN and VanillaM, we hypothesize a "vanilla" version of their methods.Memory is chunked into blocks of size g -K g = [K (0) ; • • • ; K (B−1) ] and similarly for V g .Then, one chooses the top-1 with x • (K (i) ) ⊤ .We call it VanillaController.
where j * = arg max j x • (K (i) ) ⊤ .In Fig. 8, we compare VanillaController to VanillaM with g = 1, because the actual section is at the level of g = 1.
We set k in VanillaM to the one determined by equation above.We observe VanillaM outperforms VanillaController.Although the controller design as a gating function is justified ( §5.2), the decision choice of "chunking memory but only select the best memory cells" seems unmotivated.Thus, we exclude this design setup from our analysis.

D.2 ANN
Since ANN is an approximation to exact search, we propose to randomly sabotage VanillaM, which uses the exact search.Given a k, we randomly swap n% of the top-k of memory coefficient m (exact search results) with non-top-k values (during training and validation), and has accuracy (100 − n)% We call it Naive-ANN.This is meant to set up a random baseline for ANN, because different ANN techniques might make systematic mistakes, rather than a random one.However, we believe this could still serve as a proxy and shed light on how it affects performance.As we see in Fig. 9, the model quality is sensitive to the quality of ANN.
In our preliminary study, we found building data structure after every update is expensive.This leads to some critical drawback when we apply the techniques to model parameter.Although one could amortize the cost by periodically building, the outdated data structure will lead to lower accuracy.If one chooses a hyperparameter that leads to higher quality, the cost of preprocessing and the corresponding search will be even higher.What makes it worse, the current ANN methods' search either don't support speedup by using GPU, or is not very well-integrated with GPUs -slower than calculating the exact dot product with CUDA kernel.

Figure 3 :
Figure3: FLOPs-perplexity trade-off of indirect block selection is better than that of direct block selection.Indirect methods (orange cross) have more perplexity improvement relative to increases in FLOPs than direct methods (blue dots).See a more detailed legend (e.g.include methods like LoRKM) in Fig.5.

Figure 4 :
Figure 4: Expected of shared memory cells across various block size g

Figure 5 :
Figure 5: FLOPs-Perplexity trade-off different models where direct/indirect methods are further distinguished by model name.

Figure 6 :
Figure6: Perplexity performance (lower the better) of Avg-K and VanillaM across various g.We observe large drop in perplexity when g decreases in Avg-K and less so in VanillaM; and Avg-K slightly outperform VanillaM with g ≤ 256.

Table Value Table Transformer Layer Option 1: Sparse Mixture-of-Expert Option 2: Sparse Neural Memory
Sparse Mixture-of-Expert and Sparse Neural Memory as two different methods.

Table 2 :
All the S-FFN models used in experiments and analysis in §5 -g is the number of memory cells grouped in a memory block, k is the active memory cells, and E control the sizes of a memory d m = E • (4 • d).Some settings(*) are only used for PKM.
VanillaMRandomHash Figure 2: Perplexity (lower the better) consistently improve as memory block size g decreases for both direct (VanillaM) and indirect (RandHash) selection method in S-FFN models.Ranking on individual out-of-domain test set generally follows the ranking by average perplexity (e.g.20 out of 22). a fixed number of active memory cells k, does smaller memory block size lead to lower perplexity?We use simple and robust selection methods to disentangle the impact of hyperparameter choices.Specifically, we use random hash as recommended in HashLayer (Roller et al., 2021) (denoted RandHash) for indirect block selection and exact top-k memory block (denoted VanillaM) for direct block selection.For all experiments, we use E = 16.RandHash randomly selects b = k/g unique memory blocks among all B = d m /g blocks -essentially sampling b unique values from Uniform([0, • • • , B − 1]).Originally, with block size g = 4096, a RandHash assigns a token to 4096/4096 = 1 block; with block size g = 2048, 4096/2048 = 2 blocks.

Table 3 :
List of experiments for contrastively comparing designs.This table assume each memory cell is a memory block, i.e. g = 1.The top two best performing models (bolded) have full-parameter key table and depend more on dot product to activate parameters.Ranking on individual out-of-domain test set generally follows the ranking by average perplexity (e.g.21 out of 22).

Table 4 :
Avg (Shazeer et al., 2017;Eigen et al., 2014)ction methods.Switch transformer is trained with the load balancing loss to prevent model degradation(Shazeer et al., 2017;Eigen et al., 2014).Ranking on individual out-of-domain test set generally follows the ranking by average perplexity (e.g.21 out of 22).issue in Avg-K is severe as load balancing loss is not enforced.Given its superior performance, this suggests that Avg-K learned a good representation for each memory block despite the disadvantage of load imbalance.Approximate Nearest Neighbour (ANN) searchOne might wonder whether ANN techniques could help to search for the best key in VanillaM rather than trade the expressiveness of the key table for efficiency.For example, one could process the unfactorized key table by ANN methods like FAISS

Table 8 :
FLOPs percentage of learned gate increases when memory block size g decreases

Table 9 :
Detailed out-of-domain perplexity for Table3.Best two performance on each domain is in bold.Relative ranking on each domain generally follows the relative ranking by averaged performance (i.e. last row).