Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments of tokens. Nevertheless, natural units of meaning, such as words or phrases, display varying sizes. To address this mismatch, we equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion. We compare several methods to infer boundaries, including end-to-end learning through stochastic re-parameterisation, supervised learning (based on segmentations from subword tokenizers or spikes in conditional entropy), as well as linguistically motivated boundaries. We perform character-level evaluation on texts from multiple datasets and morphologically diverse languages. The results demonstrate that dynamic pooling, which jointly segments and models language, is both faster and more accurate than vanilla Transformers and fixed-length pooling within the same computational budget.


Introduction
The Transformer architecture (Vaswani et al., 2017) lies at the heart of cutting-edge generative models, such as GPT-3 (Brown et al., 2020) for text and DALL·E 2 (Ramesh et al., 2022) for images. Its success can be largely attributed to the ability to leverage a considerable amount of data, which yields performance gains (Kaplan et al., 2020) and emergent abilities (Wei et al., 2022) in accordance with well-established scaling laws. Nonetheless, the time and memory efficiency of Transformers remains constrained by their algorithmic complexity of O(l 2 n), where l stands for sequence length and n for the number of layers.
To remedy this shortcoming without renouncing the expressivity of a deep model, the quadratic self-attention can be sparsified (Child et al., 2019;Roy et al., 2021;Ren et al., 2021) or linearly approximated (Beltagy et al., 2020). Hourglass Transformers (Nawrot et al., 2022) provide instead an alternative solution, where the sequence length is reduced in the intermediate layers by merging fixedsize groups of tokens, similar to . These pooled representations are up-sampled back to the original length in order to generate sequences in an auto-regressive fashion (Ronneberger et al., 2015).
Nevertheless, pooling groups of fixed size is suboptimal in several respects. First, these groups are misaligned with linguistic primitives: units of meaning such as morphemes, words, phrases, and clauses vary in size. Second, the elements of a sequence may carry different degrees of information (for instance, silence and voice in speech). Ideally, the model should perform hierarchical computation, relying on the same abstractions as human processing of language, and conditional, by allocating resources to sub-sequences in proportion to the model uncertainty. In this work, we demonstrate that dynamic pooling results not only in higher shortening rates of input sequences, and thus increased efficiency, but also superior performance in next token prediction due to adopting the correct inductive bias in grouping tokens.
To this end, we propose a new Transformer variant that jointly learns token sequences and dynamically pools them into latent groupings of variable size, as illustrated by Figure 1. Crucially, the segmentation must preserve the auto-regressive property. In fact, we cannot apply tokenizers directly to incomplete sequences during generation. Rather, we learn a neural boundary predictor during training, with different sources of supervision: 1) tokenizers such as Unigram (Kudo, 2018); 2) spikes in the conditional entropy of the predictive distribution, which ensure that the computation is adaptive to the level of uncertainty of the sequence model. Figure 1: The architecture of a dynamic-pooling Transformer, which jointly performs language modelling and token segmentation. After passing the input through a first block of layers, it predicts segment boundaries and pools together groups of variable length by averaging. The shortened sequence is processed by a series of intermediate layers, then up-sampled back to its original length via duplication. Finally, the model transforms its representations through a third block of layers and generates the next token x x x t .
As an alternative, we explore learning the predictor end-to-end through stochastic re-parameterisation (Maddison et al., 2017;Jang et al., 2017) or using natural data boundaries such as whitespaces, which separate words in many scripts.
To validate our model, we experiment with character-level language modelling of text in several English benchmarks, including text8 (Mahoney, 2006), CC-100 , and wiki40b (Guo et al., 2020), as well as in a series of languages representing different morphological types: Finnish, Hebrew, and Vietnamese. We find that dynamic pooling not only achieves lower time and memory complexity, but even surpasses the performance of vanilla Transformers and fixed-size pooling Transformers in most benchmarks by statistically significant margins.
Overall, our results indicate a promising direction to further accelerate training and therefore facilitate scaling. We release the code at https:// github.com/PiotrNawrot/dynamic-pooling .

Language Modelling with Transformers
Let x x x = (x 1 , . . . , x l ) denote the input sequence. A language model assigns a probability value to any possible sequence of tokens from a vocabulary V. The parameters of a model θ are optimised to maximise the aggregate probability of all x x x ∈ V * in the training set D: where t indexes time steps. In our experiments, θ corresponds to the parameters of an autoregressive Transformer model (Vaswani et al., 2017).
A key advantage of Transformers is their ability to scale, which ultimately reaps the largest benefits according to Sutton (2019)'s 'bitter lesson' and reveals surprising emergent capabilities of language models (Kaplan et al., 2020;Wei et al., 2022). Nevertheless, the algorithmic complexity of selfattention, O(l 2 ) where l is the length of the sequence, creates a bottleneck. To alleviate this cost, previous work (Clark et al., 2022;Tay et al., 2022;Nawrot et al., 2022) proposed to reduce the sequence length after the initial layers by pooling together groups of tokens. A single shortening by a factor k reduces the complexity to O( l 2 k 2 ). This allows for increasing either the model efficiency or its depth within the same compute budget.

Hourglass Transformer
Naïve length reduction through pooling is incompatible with generative models, where output token prediction must happen at the input granularity. For this reason, Nawrot et al. (2022) introduced the Hourglass Transformer, whose architecture is composed of three blocks of Transformer layers, which downsample, process, and upsample the tokens back to the original granularity. The first block encodes each input token x t into a hidden representation h h h t . Afterwards, groups of adjacent tokens of fixed length k are mean-pooled to form l k representations s s s: Next, each pooled representation s s s m is processed by the middle block of Transformer layers, which operates with complexity O( l 2 k 2 ), yielding s s s m . This sequence is up-sampled to its original resolution by duplication: u u u t = s s s t−k+1 k , added to the hidden representations h h h from before shortening through a skip connection, and passed to the third block.
Note that we subtract k − 1 from the index. This is because pooling and up-sampling in an autoregressive model pose a risk of data leakage from the future to the past. In fact, up-sampled representations might encompass future tokens if no measures are taken to prevent this. As a remedy, Hourglass Transformer shifts the up-sampled sequence to the right by k −1 positions, and pads it with a learnable null-group representation u u u 0 at the beginning. This is sufficient to satisfy the autoregressive property in the fixed pooling scenario. 1 Hourglass Transformer was shown to improve time and space complexity in a number of language and image modelling tasks, for a given parameter count. However, this came at the expense of degrading the perplexity of the language model, especially with shortening factors k > 2. We conjecture that this undesirable side effect is due to two main reasons. Firstly, the distribution of lengths of natural units of meaning such as morphemes and phrases in natural languages is uneven: for instance, word length is correlated with its frequency (Zipf, 1949;Bentz and Ferrer-i Cancho, 2016). Secondly, information content tends to be distributed uniformly across units of meaning (Meister et al., 2021).
As a consequence, fixed pooling creates segments with incongruous boundaries and unequal information content. For instance, in speech, this results in giving silence and voice the same importance. Instead, an ideal model should allocate compute conditionally on the information content of a given token. This would also ultimately lead to interpreting language hierarchically based on the same abstractions that humans adopt for language processing. Hence, we present a method to 3 Dynamic-Pooling Transformer

Boundary Prediction
In order to augment the Hourglass architecture with variable-size pooling, we seek to find a sequence of segment boundaries b b b ∈ {0, 1} l for every input x x x. Let b t = 1 denote a segment boundary between elements x t and x t+1 . The boundary predictor is implemented as a Multi-Layer Perceptron with parameters φ. As shown in Figure 1, this module maps each representation h h h t encoded by the first stack of Transformer layers into a Bernoulli probability distribution: Since segment boundaries are discrete, sampling from this distribution is not differentiable with respect to the model perplexity. Hence, we optimise this latent variable through stochastic reparametrisation (Jang et al., 2017;Maddison et al., 2017) via hard Gumbel-sigmoid (Section 3.1.1), jointly learning the language model and boundary predictor. We favour this solution over a scorefunction estimator of the gradient, as it suffers from high variance and computation costs due to sampling (Schulman et al., 2015).
As an alternative, we explore training the boundary predictor module with a binary cross-entropy loss with respect to two different sources of supervision: a Unigram tokenizer (Section 3.1.2) and spikes in conditional entropy (Section 3.1.3). Finally, we consider resorting to linguistically inspired boundaries (Section 3.1.4). During training and evaluation, we perform maximum likelihood inference for these variables. In other words, eacĥ b t from Equation (3) is rounded to the closest binary scalar such that b t = b t .

Segmenting with Gumbel-Sigmoid
In order to learn the input segmentation end-toend based on the model perplexity, we can reparameterise the Bernoulli distribution of Equation (3) by injecting stochasticity in this form: wi th one of his greatest performances in last tango 0 1 2 3 such as cpu clock speeds or measures of performance 0 1 2 3 Figure 2: Entropy of a Transformer character-level language model in two text segments. Red vertical lines indicate the boundaries according to spikes in conditional entropy. Most of them coincide with whitespaces, due to the high uncertainty at word starts, but they also fall after morphemes like 'great' or 'measure'. Segmentation may vary based on the context, e.g., of the word 'performance'.
where τ is the temperature, a hyper-parameter. This estimator, however, is biased and might lead to sub-optimal results. As a consequence, we also propose methods based on supervised learning of the boundary predictor in the following sections.

Segmenting with Subword Tokenizers
The most widespread algorithms for extracting variable-length boundaries for text are subword tokenizers, including Unigram (Kudo, 2018), Byte Pair Encoding (BPE; Sennrich et al., 2016), and WordPiece (Schuster and Nakajima, 2012). Nevertheless, these create subwords greedily. As a result, the segmentation for a given sequence prefix might change after more tokens are observed. For instance, consider the phrase 'civil aviation'. A Unigram model might choose to segment its prefix 'civil aviatio' differently before and after observing the character 'n':

_civil _a vi ati o _civil _a vi ation
During training an entire sentence is tokenized, but during inference a prefix is extended one character at a time and re-tokenized, possibly changing the boundaries like in the example above. Hence, deploying off-the-shelf tokenizers naïvely during inference does not recover the oracle segments and creates a mismatch between training and evaluation boundaries.
As a remedy, we provide the training tokenization as supervision to our autoregressive boundary predictor instead. More specifically, we employ a Unigram tokenizer (Kudo, 2018), as it aligns with morphological units better than other algorithms (Bostrom and Durrett, 2020). To prevent subword units from crossing word boundaries, we split the text on whitespace characters beforehand. Vocabulary size is a tunable hyper-parameter which offers different efficiency-performance trade-offs.

Segmenting with Entropy Spikes
As an alternative to providing supervision through Unigram, we also propose a new segmentation method based on spikes of conditional entropy, which is agnostic about the presence of natural boundaries (such as whitespaces) or the availability of tokenizers. These properties make it suitable for other modalities in addition to text, such as speech and vision. Moreover, this enables top-down supervision and end-to-end training without external dependencies.
Intuitively, in natural language the information content tends to be spread evenly throughout a sentence, to facilitate communication. The conditional entropy is the expectation of such information content over the tokens in the vocabulary: information content (5) Therefore, peaks in this conditional entropy provide indications of surprisal, and can serve as natural boundaries between segments. More formally, let H t be the conditional entropy at time t. We select local spikes by comparing their value within a (left) window of size k. We place boundaries according to the following conditions: Empirically, entropy spikes in language models overlap with word boundaries to a significant degree (Hutchens and Alder, 1998). However, they are also more flexible as they enable conditional computation based on the model's confidence about its next token prediction. As an example of segmentation based on entropy spikes, consider Figure 2.

Linguistically Inspired Segments
Finally, perhaps the most straightforward source of segmentation is word boundaries. In fact, in many scripts, these are marked by whitespace char-acters. 2 The simplicity of this method of segmentation comes with the obvious drawback of not providing control over the rate of shortening, while we found that the optimal rate varies with the language. Hence its efficiency-performance trade-off is not tunable.
Segment boundaries are placed in between two symbols. In our experiments, we put a boundary after a whitespace character. Thus, we do not need to train a boundary predictor, since predicting a whitespace character is a signal to close the group in the next iteration of auto-regressive generation. This would not be possible, had we chosen to put a boundary before a whitespace character.

Pooling and Up-sampling
After generating the sequence of boundaries b b b, we pool the tokens belonging to the same segment by averaging. Thus, we form l t=1 b t + 1 shortened representations s s s, which are then passed to the middle block of Transformer layers. Note that for Gumbel-sigmoid, to keep pooling differentiable, we algebraically manipulate b b b ∈ R l into B ∈ R l×1+ t bt , i.e. a binary matrix that maps from the original length to the shortened length, following Bhati et al. (2021). The cell B ij is 1 if token i is merged into the j-th group, and 0 otherwise. Thus, s s s = h h hB/ i B i , where the denominator unit-normalises the matrix columns.
To obtain the up-sampled representation u u u t while preserving the autoregressive property, we calculate the largest index m so that the output of the middle block s s s m does include future information: u u u t = s s s m , where m = t i=1 b i . As a consequence, a segment representation s s s m can only be added to the last token pooled into group m. For all the other non-final tokens, we take the representation of a previous segment s s s m−1 . Similar to Hourglass, the representation for the first (null) group s s s 0 is a learnable vector. Afterwards, u u u t is added to the highway layer representation h h h t .

Auxiliary objectives
In addition to minimising the language modelling loss with respect to the parameters θ as shown in Equation (1), we use auxiliary objectives to train the boundary predictor parameters φ. For supervised learning with subword tokenizers and entropy spikes, we minimise the cross-entropy between pre-dicted boundaries b b b and gold ones. For end-to-end learning with Gumbel softmax, we introduce a regularizer based on a Binomial prior. Let k = t b t : (7) where α ∈ [0, 1] is a hyper-parameter. This regularizer prevents the model from collapsing into trivially predicting each position as a boundary.
4 Experimental Setup

Datasets
In addition to English, we evaluate our model on data in three languages, which represent different morphological types: Finnish for agglutinative, Hebrew for introflexive, and Vietnamese for isolating. Thus, we ensure that dynamic pooling is robust to different word length distributions. For English, we use text8 (Mahoney, 2006), CC-100  and wiki40b (Guo et al., 2020) as they are established benchmarks for character-level language models. For the rest of the languages, we use the corresponding subsets of wiki40b. To make results comparable across languages and prevent data imbalance, we limit the size of CC-100 and wiki40b to the first 400M tokens of the training set and the first 2M tokens of the validation set. We retain the original splits for each dataset. For all datasets and languages, we follow the same pre-processing steps of Mahoney (2006) for creating text8. Specifically, for each language we keep only the characters from its script, as well as whitespace and an end-of-line. The text is lowercased, and the digits are spelt out in the target language. For wiki40b, we also remove special structural markers and normalise homoglyphs. This way, we filter out excerpts in different languages, which are known to contaminate noisy multilingual texts (Kreutzer et al., 2022). The pre-processing scripts can be found as part of our code.

Hyper-parameters
All of our experiments, except for the scaling ablation, use 12-layer Hourglass Transformers with 2 layers in the first block, 8 layers in the second block which operates on shortened sequences, and 2 layers in the final block, following Nawrot et al. (2022). For every Transformer layer, the hidden dimension is 512, the intermediate feed-forward dimension is 2048. Self-attention is split into 8 heads. We use a post-norm architecture, GELU activation function (Hendrycks and Gimpel, 2016) in feed-forward layers and the relative attention parametrisation from Transformer XL (Dai et al., 2019). In total, the model has~41M parameters.
The boundary predictor is a 2-layer MLP that takes a hidden state as input and outputs a scalar at every time step. For models with dynamic pooling, this module adds around 1M additional parameters. We use the SentencePiece (Kudo and Richardson, 2018) library to train Unigram segmentation for every dataset separately. We detect spikes in conditional entropy according to a window of size k = 2, which we select from range k =1 . . . 4 for optimal BPC on text8. For Gumbel Sigmoid, we set the prior probability of a boundary α to 0.2 for English, Vietnamese and Hebrew, and 0.37 for Finnish. The Gumbel temperature parameter was set to 0.5 in all experiments. For Unigram vocabulary size, we set |V| = 10000 for English and Vietnamese and |V| = 200 for Finnish and Hebrew.
Following Dai et al. (2019), we train for 2 · 10 5 steps with a batch size of 8 and a learning rate 2.5 · 10 −4 on 2x NVIDIA RTX 3080. Each training run took from approximately 12h to 30h, depending on the configuration. We use a linear warm-up schedule for the first 4k steps, followed by a singlecycle cosine scheduler. We use an Adam optimiser with β 1 = 0.9, β 2 = 0.999 and = 1e−8, and clip the gradients at 0.25. We apply a 0.1 dropout rate in the attention matrix and feed-forward layers. Before every epoch, we cyclically shift the text stream, divide it into non-overlapping chunks of 2048, and shuffle. During the evaluation, to provide context to the model, we split the test set into partially overlapping sequences of size l = 2048 with a step size of 512 and calculate the model perplexity only over the last 512 tokens.

Results
The results for the experiments on character-level language modelling are shown in Table 1. In addition to the four proposed segmentation methods, we include a vanilla Transformer and fixed-size pooling Transformers with multiple shortening factors as baselines. Every model is evaluated with respect to two metrics: bits per character (BPC; ↓) and shortening factor (SF; ↑). The former measures the negative log-probability of the language model predictions, and thus its quality; the latter measures the average reduction of the sequence length in intermediate layers, and thus the model efficiency. Figure 5 in Appendix B shows how SF translates to better training time and memory consumption, as measured on a GPU with an optimised implementation.
Segmentation Methods In all the English evaluation benchmarks (text8, wiki40b, and CC-100), both whitespace-based and Unigram-based segmentations achieve the lowest BPC, outperforming both vanilla and fixed-pooling Transformers by statistically significant margins. 3 Moreover, the same two methods achieve the highest degrees of shortening. Note that for equivalent SFs, fixed-size pooling becomes detrimental to performance. The approaches based on entropy spikes and Gumbel-Sigmoid are generally inferior to the alternatives for dynamic pooling. However, for comparable shortening factors, they always outperform vanilla and fixed-pooling Hourglass models. Moreover, they make the fewest assumptions about the data and the availability of external supervision, so might be applicable to other domains (such as speech and vision) in future work. In general, providing a Transformer with the correct inductive bias for pooling variable-size segments not only facilitates scaling, but also enhances prediction quality.
Notably, the gains from whitespace segmentation are not identical in all languages, due to their inherent differences in morphological types and average word length. However, this method still yields consistently superior SFs cross-lingually. These range from 4.4× in isolating Vietnamese, to 7.9× in agglutinative Finnish, whereas mildly fusional English and introflexive Hebrew lie in between with 5.5× and 5.7×, respectively. These translate into higher training speed, from 1.7× in Finnish to over 2.5× in English, while simultaneously lowering BPC. On the other hand, it is Unigram to achieve the lowest BPCs in Finnish and Hebrew. Hence, the gains from dynamic pooling are robust cross-lingually, but the optimal segmentation method may vary.  Table 1: Language modelling results on 3 English datasets and 3 other morphologically diverse languages. For each pair of method and dataset, we report test BPC (↓) and average shortening factor (SF; ↑). We run each experiment 3 times with different random seeds. We mark with a symbol experiments, which are statistically better than both vanilla baseline and fixed shortening by means of Paired Student's t-test with p < 0.05. We report the best hyper-parameter configuration for each language.  The higher the SF, the more efficient the model is (cf. Figure 5 in the Appendix). SF increases with the vocabulary size (Unigram) or prior boundary probability (Gumbel). Unigram dynamic pooling shifts the Pareto front, i.e., increases performance and accuracy. Note that fixed-pooling at k =1 corresponds to the vanilla Transformer model. and the prior α in Gumbel-Sigmoid provide easily controllable knobs to study this interaction: as they change, so does the shortening factor. In Figure 3, we plot BPC and SF for six vocabulary sizes (200, 500, 1k, 3k, 5k, 10k) and five α values (0.20, 0.25, 0.30, 0.37, 0.45) and compare them with fixed-size pooling in Hourglass Transformers. Manifestly, dynamic pooling enhances the Pareto front by finding more optimal trade-offs between efficiency and performance. Moreover, while fixed pooling follows a similar trend cross-lingually, dynamic pooling behaves more idiosyncratically: e.g., BPC in Vietnamese surprisingly improves with higher SFs.

Time and Space Complexity
To showcase the advantages of efficiency resulting from higher SFs more concretely, we present measurements of memory consumption and training time of our implementation based on Pytorch run on a typical GPU with full float precision ( Figure 5). With a shortening factor of 2, the model reduces memory consumption and training time by 43% and 44%, respectively, compared to a vanilla Transformer. Increasing the shortening factor to 4, where dynamicpooling Hourglass still manages to achieve superior BPC scores, reduces memory consumption and training time by 53% and 60%, respectively. This allows to increase the size of the models or fit them on hardware with lower compute budgets, while simultaneously benefiting its performance.
Scaling the Model We investigate if the performance of dynamic-pooling Transformers scales well in terms of model size, by adding more layers in the middle block. We focus on this block as it increases the model depth (and hence its capacity) while retaining the highest efficiency as it operates on shortened sequences. We present this ablation in Figure 4. We find that the gains from dynamic pooling are consistent across all numbers of layers. Extrapolating from the trends, dynamic pooling holds promise to continue providing benefits even in extremely large language models.
Other Efficient Transformer Models Finally, we remark that our method differs from some efficient Transformer algorithms, which reduce the quadratic complexity of attention (Child et al., 2019;Lee-Thorp et al., 2022;Choromanski et al., 2021;Wang et al., 2020), as it focuses on length reduction. Combining both strategies may yield fur- ther gains. Moreover, those efficient variants tend to trade quality for efficiency. We have shown that the dynamic-pooling mechanism improves both simultaneously in our experiments.

Related Work
Dynamic RNNs Our approach is inspired by variants of RNNs that process sequences at varying time scales by introducing a hierarchy of hidden units. For instance, RNNs that mimic speedreading by introducing hidden units that can skip over some input elements (Campos et al., 2018;Seo et al., 2018). Similarly, Chung et al. (2017) discovers the latent hierarchy of an input sequence using a stack of LSTMs. Each layer is equipped with a binary gate responsible for hard boundary detection, where lower-level boundaries determine state updates made by higher-level layers. Whenever the detector ends a segment, its representation is fed to the upper layer.
Early slow-and fast-changing units were already described by Hihi and Bengio (1995). Similarly, Clockwork RNN (Koutnik et al., 2014) introduces a hierarchy of hidden state units that make transitions at a set of different, fixed frequencies. Adaptive Computation Time networks perform a different amount of computation on each item in the sequence (Graves, 2016). Both ideas were combined in Fast-Slow RNNs (Mujika et al., 2017) which can choose between heavy and lightweight transition functions between timesteps.
Pooling Transformer models While pooling blocks in Transformers are related to slowly varying units in RNNs, their operation is different. RNNs suffer from unreliable transport of information across long time spans. Units that act like skipconnections over time can help them to carry information (Krueger et al., 2017). In a Transformer network, a unit at time t can directly communicate with any other unit, including previous ones, and we find it important to confirm the benefits of dynamic pooling in Transformer models.
Perhaps the most similar approach to ours is Funnel Transformer  which uses a similar, hourglass-shaped Transformer architecture. After passing through the first block, the data is pooled at a fixed rate, processed by the deep middle Transformer block, and up-sampled for the last block. Canine (Clark et al., 2022) has a similar three-part architecture, and processes Unicode inputs, which are downsampled with Transformer and convolution layers. (Tay et al., 2022) implements gradient-based subword tokenization within a Transformer model, which learns dynamic groupings of tokens into fixed-size groups. In Segatron (Bai et al., 2021), externally sourced sentence and paragraph boundaries of variable length were used as additional conditioning for the model.

Boundary detection We investigate boundaries
provided by an external model, derived directly from the data, or top-down from the model's entropy. Kreuk et al. (2020) shows a bottom-up approach to phoneme segmentation task combining contrastive learning (van den Oord et al., 2019) with a method for boundary detection based on dissimilarity between subsequent frames. It was later extended by (Bhati et al., 2021) to segment the sequence of speech frames dynamically. Recently, Cuervo et al. (2022) introduced a hierarchical sequence processing model in which units in the upper layer operate on a dynamically shortened sequence, with the shortening guided by a boundary prediction model. Rocki et al. (2016) control the activity of LSTM gates with the model's output cross-entropy. Alpay et al. (2019) used a similar mechanism based on information content to guide the copying of individual activations in an LSTM network. Similarly, we employ the entropy of model predictions to choose where to insert boundaries.

Conclusions
We proposed a new family of language models that pool variable-size segments of tokens in the intermediate layers in order to enhance the efficiency and performance of the Transformer architecture. In particular, we learn a boundary predictor either end-to-end through stochastic re-parameterisation, through supervision (obtained from subword tokenization or spikes in the conditional entropy), or based on linguistic boundaries such as words. We evaluate this model extensively on multiple language modelling benchmarks in English and in other typologically diverse languages: Finnish, Hebrew, and Vietnamese. Compared to vanilla Transformers and fixed pooling, we observe a significant decrease in model perplexity as well as time and space complexity. This opens up the perspective to develop Transformer models capable of computing language both hierarchically, with the same abstractions humans perform at different levels of linguistic structure, and conditionally on the information content of each segment.
In the future, our dynamic-pooling Transformer can be combined with methods relying on external memory , encoders operating at a fine resolution (Xue et al., 2022;Tay et al., 2022), and more generally any task with long-context inputs (Shaham et al., 2022).

Limitations
Linguistic Variation The results we obtain are highly dependent on the target language and its morphology. For example, word boundaries might seem like an obvious first choice for dynamic segmentation, and in fact, they achieve the best performance in English and Vietnamese. On the other hand, for some languages like agglutinative Finnish, whitespaces effect high shortening rates, which are detrimental to model performance. Furthermore, explicit word boundaries are not always available in the data for all scripts. For example, in Chinese characters or modalities other than text, like speech or vision, there is no obvious equivalent to whitespaces. However, segmentation based on stochastic re-parameterisation, subword tokenizers and spikes in conditional entropy overcomes these limitations.
Contiguous segments In its current formulation, dynamic pooling only allows for merging contiguous segments of tokens in a sequence. However, this is not ideal for morphology types like Hebrew where morphemes are discontinuous: vowels are interspersed between consonant roots for inflection. Moreover, future works should consider higher levels of linguistic structure than words, such as dependency trees, for pooling. In this case, discontinuous segments may be necessary to handle non-projective syntactic dependencies.

Independent boundary decisions
The decision to emit a boundary at time step t depends on previous boundaries only indirectly through the hidden representation of the first Transformer block, as this preserves the efficiency of the boundary predictor. Instead, a recurrent model could be explicitly conditioned on previous boundary decisions, which however would negatively affect the time complexity of the language model.

Work contribution of authors
The idea of training the models with pooling of variable-length segments was discussed among the authors while Jan Chorowski was at the University of Wrocław. Experiments were performed by Piotr Nawrot while he was employed in a research grant at the University of Wrocław, under the supervision of Adrian Łańcucki and Edoardo M. Ponti. The manuscript was written by Piotr Nawrot, Adrian Łańcucki and Edoardo M. Ponti.
As an ablation, we also compare different methods to represent groups of tokens when shortening the input sequence length: average pooling, reported in our experiments, and sub-sampling, i.e., selecting only the last token as a representative for each group. As it emerges from

B Shortening benefits
We quantify the reduction in GPU memory and training time that results from different shortening factors (SFs) in Figure 5. We measure these metrics on an NVIDIA GV100 32GB GPU with our text8 models. Results apply to dynamic-pooling (Gumbel, Whitespace, Unigram, and Entropy), fixed-pooling, and vanilla Transformers (only for SF=1). At an SF 4-6×, less than 50% of memory is used and training is 2.5× faster.